User Guide: How to Create Tables of Baseline Characteristics (Descriptive Statistics)

Certara PMxO R Tools Workstream

29-Sep-2020

Introduction

The table1c package is a light wrapper around the package table1, with some customizations for the convenience of Certara IDD.

This vignette serves as User Guide for the package. We run through an example using a simulated dataset Xyz‑pk.csv with data pooled from three hypothetical studies. This dataset is in the style of a NONMEM PopPK dataset. Click here to download it (note: due to a Chrome bug the file may download with a .xls extension, which is incorrect; find the file where it was saved and change it to .csv).

(Note: The dataset has column names in all lower case letters, versus the more traditional upper case used in NONMEM. This is preferred because we end up typing these names a lot, and by avoiding the strain of multi-key combinations needed for capital letters it is not only faster to type, but also decreases the risk of repetitive strain injury.)

Using a Data Specification File

One of the benefits of this package is the ability to separate meta-data (data about data) from scripting logic, by segregating meta-data into a central location (the data specification file), which results in scripts that are simpler, more generic and re-usable.

The data specification is written in YAML (see below), a suitable language for encoding data or meta-data, which allows it to be clear and concise. (Note: currently this file needs to be written by hand, but in the future it’s generation may be partially or fully automated.)

What is YAML?

YAML is a markup language for encoding structured data, similar to XML or JSON, but more geared towards human readability (you may already be familiar with YAML since it is used in the header of R markdown documents). It has the advantage of being both very easy for humans to read and write, as well as machine parseable (because although it looks natural, it actually has strict syntactic rules), with support in many popular languages, including R. You can edit YAML files in RStudio (with syntax highlighting).

The data specification file

The way YAML works is best illustrated with an example. The current directory contains the file data_spec.yaml which contains the following:

The meaning of the file contents is intuitively clear. Indentation is used to denote hierarchical structure. Line breaks separate items from each other. Space are used to indent things, and other than that spaces are basically ignored (except inside strings).

Warning: Make sure you are not using tabs instead of spaces; it can be hard to tell, and YAML is sensitive to this difference. If you have an error when reading the file, this is something to check. Most editor programs have a setting that will cause the tab key to insert a number of spaces instead of a tab character (RStudio does under ‘Tools>Global Options>Code>Editing>General’). Some also have a feature that allows you to “see” the whitespace characters (in RStudio it’s in ‘Tools>Global Options>Code>Display>General’) which can help to debug the problem.

Data structures come in 2 forms: sequential (i.e. lists) and named (i.e. dictionaries). For sequential data, each element is preceded by a dash and whitespace (don’t use tabs) (e.g. - item); thus, it looks the way one would write a list in a plain-text e-mail, for instance. Named data consists of key-value pairs, where a colon and whitespace (don’t use tabs) separate the key from the value (e.g. key: value); if the value is itself a nested structure, it can appear indented starting on the next line (same for list items). Primitive types (numbers, strings) are written the way one would write them naturally. In most cases, strings to not need to be quoted (but they can be); there are some exceptions though. Strings can contain Unicode symbols. For more details on the syntax, see the YAML documentation.

In the example above, the whole file encodes a named structure, with 3 top-level items: dataset, labels and categoricals. The dataset item contains a single string, the name of a .csv file that contains the data to which this meta-data is associated. The labels item contains another named structure: key-value pairs of column names and associated labels. The last item, categoricals, contains information on the coding of certain variables (i.e., variables that are really categorical but have been assigned numeric codes in the dataset). When the data is presented in a table, these variables should be translated back to their original descriptive identifiers. Nested within the categoricals item is another named structure. Here, the names correspond to columns in the dataset, and the values are lists, whereby each list item relates a (numeric) code to its (string) identifier.

(Note: currently this file needs to be written by hand, but in the future its generation could be partially or fully automated.)

Reading a dataset from its specification

With the data_spec.yaml file above, we can use the read_from_spec() function to read the data and have it augmented with the meta-data from the spec file:

Note that we did not need to include the name of the data file in our script, since it is contained in the spec.

Before proceeding to describe the baseline characteristics of our study subjects, we need to make sure that each individual is only counted once. There is a convenience function for that:

Only columns that are invariant (and hence unambiguous) within each ID level are retained.

Here are six random rows of the resulting dataset:

##      id      study    sex              race                 ethnic
## 928 159 Xyz-ph3-07 Female             White     Hispanic or Latino
## 118  14  Xyz-hv-01   Male             White Not Hispanic or Latino
## 442  50  Xyz-ri-02 Female Multiple or Other Not Hispanic or Latino
## 764 118 Xyz-ph3-07 Female             White Not Hispanic or Latino
## 379  43  Xyz-ri-02 Female             Asian           Not reported
## 968 169 Xyz-ph3-07 Female             White Not Hispanic or Latino
##                  hv    form  fasted age        agecat       wt       ht
## 928         Patient  Tablet Unknown  40 < 65 years 69.11780 162.6218
## 118 Healthy Subject Capsule     Fed  32 < 65 years 94.09640 178.6365
## 442         Patient  Tablet  Fasted  79    = 65 years 46.29503 154.5257
## 764         Patient  Tablet Unknown  37 < 65 years 63.69094 168.2729
## 379         Patient  Tablet  Fasted  67    = 65 years 75.17084 156.0623
## 968         Patient  Tablet Unknown  69    = 65 years 51.20350 149.6104
##          bmi      bsa      alb       alp      alt      ast      bili     creat
## 928 26.13560 1.742997 41.39164  59.16552 20.20104 29.25946  5.831341  59.48539
## 118 29.48715 2.127224 42.39993 102.36809 29.20991 29.82011 15.139345 106.61792
## 442 19.38798 1.416590 37.22390 140.36920 11.99238 29.56350  8.078682 159.12966
## 764 22.49309 1.725678 41.66294  66.00227 10.89596 58.22654  6.864295  46.61096
## 379 30.86407 1.753188 36.73265 131.09500 17.25876 15.40761 10.866950 183.66983
## 968 22.87580 1.444328 35.25484  60.00767 23.62474 16.59708  8.345510  65.85870
##          crcl fdarenal
## 928 121.26020   Normal
## 118 117.02707   Normal
## 442  18.52046   Severe
## 764 146.88099   Normal
## 379  31.17979 Moderate
## 968  57.60809 Moderate

Note that the categorical variables (which were numeric in the original .csv file) have been translated to factors, with the appropriate textual labels, and in the desired order. Compared to the corresponding R code that would be needed to achieve this, the YAML specification is much cleaner and more concise. Similarly for the label attributes.

Note on preserving label attributes: in most cases, subsetting a data.frame results in the label attributes being stripped away. The function subsetp() (‘p’ for preserve) can be used to avoid this. (It is used internally in one_row_per_id(), for instance.)

Abbreviations in a footnote

By convention, it is required that all abbreviations appearing in a table (or figure) be spelled out in full in a footnote. The package contains a mechanism for generating such footnotes in a convenient, semi-automated way. It uses a higher-order function (i.e., a function that returns a new function) called make_abbrev_footnote(). It again uses a YAML file (or a simple list), in this case to specify a complete list of abbreviations that can be drawn from.

In this example, the current directory contains the file abbrevs.yaml, the contents of which are as follows:

ALP  : alkaline phosphatase
ALT  : alanine aminotransferase
AST  : aspartate aminotransferase
BMI  : body mass index
BSA  : body surface area
CrCL : creatinine clearance
SD   : standard deviation
CV   : coefficient of variation
Max  : maximum
Min  : minimum
"N"  : number of subjects
FDA  : Food and Drug Administration

The meaning of this file is pretty self-explanatory.

To use this file, we pass it’s name to the function make_abbrev_footnote(), which returns a new function that now knows how to expand the abbreviations in the YAML file to generate a string that can be passed to the footnote argument of table1().

# Set up function for abbreviation footnotes
abbrev_footnote <- make_abbrev_footnote("abbrevs.yaml")

We can test it out:

abbrev_footnote("N", "FDA")
## [1] "FDA=Food and Drug Administration; N=number of subjects."

Note that in the result (by default), the abbreviations are sorted alphabetically. Thus, the idea is to identify the abbreviations that appear in the table, and pass them as string arguments (in any order) to the newly created abbrev_footnote function.

Tables of Baseline Characteristics

To recap, so far our script has 3 lines of code:

# Read in the data from its 'spec'
dat <- read_from_spec("data_spec.yaml")

# Filter the data, one row per ID
dat <- one_row_per_id(dat, "id")

# Set up function for abbreviation footnotes
abbrev_footnote <- make_abbrev_footnote("abbrevs.yaml")

That is all we need to be able to start creating our tables!

The variables to include in the table are specified using a one-sided formula, with stratification denoted by conditioning (i.e., the name of the stratification variable appears to the right of a vertical bar).

In this case, the data contains three studies, and the descriptive statistics are presented stratified by study, and overall. In general, if there are multiple studies it makes sense to stratify by study, and if there is a single study, there is usually some other variable that it makes sense to stratify on, like treatment arm or cohort.

In this example, I have split the baseline characteristics into two tables by logical grouping, simply because there are too many of them to fit comfortably in a single table. The logical groups I have used are:

And here are the results:

Summary of Baseline Characteristics in the PK Population – Demographic

Xyz-hv-01
(N=16)
Xyz-ri-02
(N=43)
Xyz-ph3-07
(N=111)
Overall
(N=170)
Sex
Male 16 (100%) 22 (51.2%) 50 (45.0%) 88 (51.8%)
Female 0 (0%) 21 (48.8%) 61 (55.0%) 82 (48.2%)
Race
White 16 (100%) 19 (44.2%) 62 (55.9%) 97 (57.1%)
Black or African American 0 (0%) 4 (9.3%) 12 (10.8%) 16 (9.4%)
Asian 0 (0%) 10 (23.3%) 18 (16.2%) 28 (16.5%)
American Indian or Alaskan Native 0 (0%) 4 (9.3%) 5 (4.5%) 9 (5.3%)
Native Hawaiian or Other Pacific Islander 0 (0%) 2 (4.7%) 3 (2.7%) 5 (2.9%)
Multiple or Other 0 (0%) 4 (9.3%) 11 (9.9%) 15 (8.8%)
Ethnicity
Not Hispanic or Latino 16 (100%) 36 (83.7%) 92 (82.9%) 144 (84.7%)
Hispanic or Latino 0 (0%) 4 (9.3%) 10 (9.0%) 14 (8.2%)
Not reported 0 (0%) 3 (7.0%) 9 (8.1%) 12 (7.1%)
Health Status
Healthy Subject 16 (100%) 0 (0%) 0 (0%) 16 (9.4%)
Patient 0 (0%) 43 (100%) 111 (100%) 154 (90.6%)
Age (y)
Mean (SD) 35.4 (6.09) 48.7 (18.1) 47.4 (17.1) 46.6 (17.0)
Median (CV%) 34.0 (17.2) 45.0 (37.1) 47.0 (36.0) 43.5 (36.4)
[Min, Max] [28.0, 45.0] [19.0, 79.0] [20.0, 80.0] [19.0, 80.0]
Age Group
< 65 years 16 (100%) 34 (79.1%) 88 (79.3%) 138 (81.2%)
≥ 65 years 0 (0%) 9 (20.9%) 23 (20.7%) 32 (18.8%)
Body Weight (kg)
Mean (SD) 76.7 (15.6) 73.0 (15.6) 69.7 (15.1) 71.2 (15.3)
Median (CV%) 72.5 (20.3) 74.2 (21.4) 69.0 (21.7) 71.4 (21.6)
[Min, Max] [53.1, 108] [45.0, 102] [35.9, 119] [35.9, 119]
Height (cm)
Mean (SD) 178 (7.91) 169 (12.0) 169 (10.0) 170 (10.7)
Median (CV%) 178 (4.5) 169 (7.1) 168 (5.9) 170 (6.3)
[Min, Max] [165, 196] [142, 192] [146, 192] [142, 196]
BMI (kg/m²)
Mean (SD) 24.2 (3.94) 25.8 (5.26) 24.3 (4.09) 24.7 (4.42)
Median (CV%) 25.2 (16.3) 25.4 (20.4) 24.0 (16.8) 24.3 (17.9)
[Min, Max] [17.4, 29.5] [16.1, 39.5] [15.6, 38.8] [15.6, 39.5]
BSA (m²)
Mean (SD) 1.94 (0.212) 1.82 (0.224) 1.79 (0.223) 1.81 (0.225)
Median (CV%) 1.87 (10.9) 1.79 (12.3) 1.77 (12.4) 1.79 (12.4)
[Min, Max] [1.58, 2.41] [1.32, 2.24] [1.25, 2.32] [1.25, 2.41]

BMI=body mass index; BSA=body surface area; Max=maximum; Min=minimum; N=number of subjects; SD=standard deviation.

Summary of Baseline Characteristics in the PK Population – Laboratory Tests

Xyz-hv-01
(N=16)
Xyz-ri-02
(N=43)
Xyz-ph3-07
(N=111)
Overall
(N=170)
Albumin (g/L)
Mean (SD) 46.2 (4.74) 41.4 (4.27) 41.8 (4.18) 42.1 (4.44)
Median (CV%) 45.6 (10.2) 41.6 (10.3) 41.6 (10.0) 42.0 (10.5)
[Min, Max] [39.1, 55.9] [33.2, 49.4] [32.7, 51.3] [32.7, 55.9]
ALP (U/L)
Mean (SD) 74.0 (39.4) 82.6 (35.9) 82.4 (33.1) 81.7 (34.3)
Median (CV%) 55.5 (53.3) 73.2 (43.5) 75.6 (40.1) 74.2 (42.0)
[Min, Max] [29.8, 157] [31.5, 177] [21.2, 186] [21.2, 186]
ALT (U/L)
Mean (SD) 26.4 (16.3) 20.1 (11.7) 21.9 (15.4) 21.9 (14.7)
Median (CV%) 23.2 (61.6) 17.7 (58.3) 18.8 (70.4) 19.0 (67.1)
[Min, Max] [10.8, 79.0] [6.36, 73.7] [4.43, 104] [4.43, 104]
AST (U/L)
Mean (SD) 28.5 (16.0) 25.1 (12.9) 24.5 (12.0) 25.0 (12.6)
Median (CV%) 25.9 (56.2) 22.7 (51.4) 21.0 (48.9) 21.7 (50.3)
[Min, Max] [8.11, 62.1] [6.24, 65.4] [6.89, 69.3] [6.24, 69.3]
Bilirubin (µmol/L)
Mean (SD) 10.0 (3.98) 8.49 (3.72) 10.0 (5.13) 9.63 (4.74)
Median (CV%) 9.68 (39.7) 8.24 (43.8) 8.66 (51.3) 8.64 (49.2)
[Min, Max] [4.20, 15.3] [3.07, 18.3] [2.44, 32.6] [2.44, 32.6]
CrCL (mL/min)
Mean (SD) 131 (31.1) 49.0 (16.1) 112 (42.3) 97.7 (46.4)
Median (CV%) 120 (23.7) 50.1 (32.9) 103 (37.8) 94.3 (47.5)
[Min, Max] [96.1, 221] [18.5, 95.9] [38.2, 233] [18.5, 233]
Renal Impairment, FDA Classification
Normal 16 (100%) 1 (2.3%) 76 (68.5%) 93 (54.7%)
Mild 0 (0%) 8 (18.6%) 25 (22.5%) 33 (19.4%)
Moderate 0 (0%) 30 (69.8%) 10 (9.0%) 40 (23.5%)
Severe 0 (0%) 4 (9.3%) 0 (0%) 4 (2.4%)

ALP=alkaline phosphatase; ALT=alanine aminotransferase; AST=aspartate aminotransferase; CrCL=creatinine clearance; FDA=Food and Drug Administration; Max=maximum; Min=minimum; N=number of subjects; SD=standard deviation.

For continuous variables, the arithmetic mean, standard deviation, median, coefficient of variation, minimum and maximum are displayed on 3 lines. For categorical variable the frequency and percentage (within each column) are shown, one line per category. Missing values, if any, will also be shown as count and percent. The following rounding rules are applied:

Continuous and categorical variables can be mixed in the same table (note that age and age category are next to each other).

The above tables (including headings and footnotes) can be copied from the Chrome browser directly to a Word document (or Excel sheet). All formatting will be preserved (the one exception I have found is that in certain cases, superscripts or subscripts may be too small, so it might be necessary to reset the font site for the whole table to 10 pt). Pasting into PowerPoint works too, but does a less good a job at preserving the formatting, so it may be better to paste to a Word document first, and then copy that to PowerPoint (use the “Keep Source Formatting” option when pasting).

The complete example can be found on GitHub.

R session information

sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] table1c_0.1     table1_1.2      rmarkdown_2.2   nvimcom_0.9-102
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.1  magrittr_1.5    htmltools_0.5.0 tools_4.0.1    
##  [5] yaml_2.2.1      stringi_1.4.6   knitr_1.28      Formula_1.2-3  
##  [9] stringr_1.4.0   xfun_0.14       digest_0.6.25   rlang_0.4.6    
## [13] evaluate_0.14