The table1c
package is a light wrapper around the package table1, with some customizations for the convenience of Certara IDD.
This vignette serves as User Guide for the package. We run through an example using a simulated dataset Xyz‑pk.csv with data pooled from three hypothetical studies. This dataset is in the style of a NONMEM PopPK dataset. Click here to download it (note: due to a Chrome bug the file may download with a .xls extension, which is incorrect; find the file where it was saved and change it to .csv).
(Note: The dataset has column names in all lower case letters, versus the more traditional upper case used in NONMEM. This is preferred because we end up typing these names a lot, and by avoiding the strain of multi-key combinations needed for capital letters it is not only faster to type, but also decreases the risk of repetitive strain injury.)
One of the benefits of this package is the ability to separate meta-data (data about data) from scripting logic, by segregating meta-data into a central location (the data specification file), which results in scripts that are simpler, more generic and re-usable.
The data specification is written in YAML (see below), a suitable language for encoding data or meta-data, which allows it to be clear and concise. (Note: currently this file needs to be written by hand, but in the future it’s generation may be partially or fully automated.)
YAML is a markup language for encoding structured data, similar to XML or JSON, but more geared towards human readability (you may already be familiar with YAML since it is used in the header of R markdown documents). It has the advantage of being both very easy for humans to read and write, as well as machine parseable (because although it looks natural, it actually has strict syntactic rules), with support in many popular languages, including R. You can edit YAML files in RStudio (with syntax highlighting).
The way YAML works is best illustrated with an example. The current directory contains the file data_spec.yaml
which contains the following:
dataset: Xyz-pk.csv
labels:
sex: Sex
race: Race
ethnic: Ethnicity
hv: Health Status
age: Age (y)
agecat: Age Group
wt: Body Weight (kg)
ht: Height (cm)
bmi: BMI (kg/m²)
bsa: BSA (m²)
alb: Albumin (g/L)
alp: ALP (U/L)
alt: ALT (U/L)
ast: AST (U/L)
bili: Bilirubin (µmol/L)
crcl: CrCL (mL/min)
fdarenal: Renal Impairment, FDA Classification
form: Formulation
fasted: Fasting Status
categoricals:
study:
- 1: Xyz-hv-01
- 2: Xyz-ri-02
- 7: Xyz-ph3-07
sex:
- 0: Male
- 1: Female
race:
- 1: White
- 2: Black or African American
- 3: Asian
- 4: American Indian or Alaskan Native
- 5: Native Hawaiian or Other Pacific Islander
- 6: Multiple or Other
ethnic:
- 1: Not Hispanic or Latino
- 2: Hispanic or Latino
- -99: Not reported
agecat:
- 0: < 65 years
- 1: "\u2265 65 years"
hv:
- 1: Healthy Subject
- 0: Patient
fdarenal:
- 0: Normal
- 1: Mild
- 2: Moderate
- 3: Severe
form:
- 1: Capsule
- 2: Tablet
fasted:
- 0: Fed
- 1: Fasted
- -99: Unknown
The meaning of the file contents is intuitively clear. Indentation is used to denote hierarchical structure. Line breaks separate items from each other. Space are used to indent things, and other than that spaces are basically ignored (except inside strings).
Warning: Make sure you are not using tabs instead of spaces; it can be hard to tell, and YAML is sensitive to this difference. If you have an error when reading the file, this is something to check. Most editor programs have a setting that will cause the tab key to insert a number of spaces instead of a tab character (RStudio does under ‘Tools>Global Options>Code>Editing>General’). Some also have a feature that allows you to “see” the whitespace characters (in RStudio it’s in ‘Tools>Global Options>Code>Display>General’) which can help to debug the problem.
Data structures come in 2 forms: sequential (i.e. lists) and named (i.e. dictionaries). For sequential data, each element is preceded by a dash and whitespace (don’t use tabs) (e.g. - item
); thus, it looks the way one would write a list in a plain-text e-mail, for instance. Named data consists of key-value pairs, where a colon and whitespace (don’t use tabs) separate the key from the value (e.g. key: value
); if the value is itself a nested structure, it can appear indented starting on the next line (same for list items). Primitive types (numbers, strings) are written the way one would write them naturally. In most cases, strings to not need to be quoted (but they can be); there are some exceptions though. Strings can contain Unicode symbols. For more details on the syntax, see the YAML documentation.
In the example above, the whole file encodes a named structure, with 3 top-level items: dataset
, labels
and categoricals
. The dataset
item contains a single string, the name of a .csv file that contains the data to which this meta-data is associated. The labels
item contains another named structure: key-value pairs of column names and associated labels. The last item, categoricals
, contains information on the coding of certain variables (i.e., variables that are really categorical but have been assigned numeric codes in the dataset). When the data is presented in a table, these variables should be translated back to their original descriptive identifiers. Nested within the categoricals
item is another named structure. Here, the names correspond to columns in the dataset, and the values are lists, whereby each list item relates a (numeric) code to its (string) identifier.
(Note: currently this file needs to be written by hand, but in the future its generation could be partially or fully automated.)
With the data_spec.yaml
file above, we can use the read_from_spec()
function to read the data and have it augmented with the meta-data from the spec file:
Note that we did not need to include the name of the data file in our script, since it is contained in the spec.
Before proceeding to describe the baseline characteristics of our study subjects, we need to make sure that each individual is only counted once. There is a convenience function for that:
Only columns that are invariant (and hence unambiguous) within each ID level are retained.
Here are six random rows of the resulting dataset:
## id study sex race ethnic
## 928 159 Xyz-ph3-07 Female White Hispanic or Latino
## 118 14 Xyz-hv-01 Male White Not Hispanic or Latino
## 442 50 Xyz-ri-02 Female Multiple or Other Not Hispanic or Latino
## 764 118 Xyz-ph3-07 Female White Not Hispanic or Latino
## 379 43 Xyz-ri-02 Female Asian Not reported
## 968 169 Xyz-ph3-07 Female White Not Hispanic or Latino
## hv form fasted age agecat wt ht
## 928 Patient Tablet Unknown 40 < 65 years 69.11780 162.6218
## 118 Healthy Subject Capsule Fed 32 < 65 years 94.09640 178.6365
## 442 Patient Tablet Fasted 79 = 65 years 46.29503 154.5257
## 764 Patient Tablet Unknown 37 < 65 years 63.69094 168.2729
## 379 Patient Tablet Fasted 67 = 65 years 75.17084 156.0623
## 968 Patient Tablet Unknown 69 = 65 years 51.20350 149.6104
## bmi bsa alb alp alt ast bili creat
## 928 26.13560 1.742997 41.39164 59.16552 20.20104 29.25946 5.831341 59.48539
## 118 29.48715 2.127224 42.39993 102.36809 29.20991 29.82011 15.139345 106.61792
## 442 19.38798 1.416590 37.22390 140.36920 11.99238 29.56350 8.078682 159.12966
## 764 22.49309 1.725678 41.66294 66.00227 10.89596 58.22654 6.864295 46.61096
## 379 30.86407 1.753188 36.73265 131.09500 17.25876 15.40761 10.866950 183.66983
## 968 22.87580 1.444328 35.25484 60.00767 23.62474 16.59708 8.345510 65.85870
## crcl fdarenal
## 928 121.26020 Normal
## 118 117.02707 Normal
## 442 18.52046 Severe
## 764 146.88099 Normal
## 379 31.17979 Moderate
## 968 57.60809 Moderate
Note that the categorical variables (which were numeric in the original .csv file) have been translated to factors, with the appropriate textual labels, and in the desired order. Compared to the corresponding R code that would be needed to achieve this, the YAML specification is much cleaner and more concise. Similarly for the label attributes.
Note on preserving label attributes: in most cases, subsetting a data.frame
results in the label attributes being stripped away. The function subsetp()
(‘p’ for preserve) can be used to avoid this. (It is used internally in one_row_per_id()
, for instance.)
By convention, it is required that all abbreviations appearing in a table (or figure) be spelled out in full in a footnote. The package contains a mechanism for generating such footnotes in a convenient, semi-automated way. It uses a higher-order function (i.e., a function that returns a new function) called make_abbrev_footnote()
. It again uses a YAML file (or a simple list
), in this case to specify a complete list of abbreviations that can be drawn from.
In this example, the current directory contains the file abbrevs.yaml
, the contents of which are as follows:
ALP : alkaline phosphatase
ALT : alanine aminotransferase
AST : aspartate aminotransferase
BMI : body mass index
BSA : body surface area
CrCL : creatinine clearance
SD : standard deviation
CV : coefficient of variation
Max : maximum
Min : minimum
"N" : number of subjects
FDA : Food and Drug Administration
The meaning of this file is pretty self-explanatory.
To use this file, we pass it’s name to the function make_abbrev_footnote()
, which returns a new function that now knows how to expand the abbreviations in the YAML file to generate a string that can be passed to the footnote
argument of table1()
.
# Set up function for abbreviation footnotes
abbrev_footnote <- make_abbrev_footnote("abbrevs.yaml")
We can test it out:
## [1] "FDA=Food and Drug Administration; N=number of subjects."
Note that in the result (by default), the abbreviations are sorted alphabetically. Thus, the idea is to identify the abbreviations that appear in the table, and pass them as string arguments (in any order) to the newly created abbrev_footnote
function.
To recap, so far our script has 3 lines of code:
# Read in the data from its 'spec'
dat <- read_from_spec("data_spec.yaml")
# Filter the data, one row per ID
dat <- one_row_per_id(dat, "id")
# Set up function for abbreviation footnotes
abbrev_footnote <- make_abbrev_footnote("abbrevs.yaml")
That is all we need to be able to start creating our tables!
The variables to include in the table are specified using a one-sided formula, with stratification denoted by conditioning (i.e., the name of the stratification variable appears to the right of a vertical bar).
In this case, the data contains three studies, and the descriptive statistics are presented stratified by study, and overall. In general, if there are multiple studies it makes sense to stratify by study, and if there is a single study, there is usually some other variable that it makes sense to stratify on, like treatment arm or cohort.
In this example, I have split the baseline characteristics into two tables by logical grouping, simply because there are too many of them to fit comfortably in a single table. The logical groups I have used are:
And here are the results:
table1(~ sex + race + ethnic + hv + age + agecat + wt + ht + bmi + bsa | study, data=dat,
footnote=abbrev_footnote("BMI", "BSA", "SD", "Min", "Max", "N"))
Xyz-hv-01 (N=16) |
Xyz-ri-02 (N=43) |
Xyz-ph3-07 (N=111) |
Overall (N=170) |
|
---|---|---|---|---|
Sex | ||||
Male | 16 (100%) | 22 (51.2%) | 50 (45.0%) | 88 (51.8%) |
Female | 0 (0%) | 21 (48.8%) | 61 (55.0%) | 82 (48.2%) |
Race | ||||
White | 16 (100%) | 19 (44.2%) | 62 (55.9%) | 97 (57.1%) |
Black or African American | 0 (0%) | 4 (9.3%) | 12 (10.8%) | 16 (9.4%) |
Asian | 0 (0%) | 10 (23.3%) | 18 (16.2%) | 28 (16.5%) |
American Indian or Alaskan Native | 0 (0%) | 4 (9.3%) | 5 (4.5%) | 9 (5.3%) |
Native Hawaiian or Other Pacific Islander | 0 (0%) | 2 (4.7%) | 3 (2.7%) | 5 (2.9%) |
Multiple or Other | 0 (0%) | 4 (9.3%) | 11 (9.9%) | 15 (8.8%) |
Ethnicity | ||||
Not Hispanic or Latino | 16 (100%) | 36 (83.7%) | 92 (82.9%) | 144 (84.7%) |
Hispanic or Latino | 0 (0%) | 4 (9.3%) | 10 (9.0%) | 14 (8.2%) |
Not reported | 0 (0%) | 3 (7.0%) | 9 (8.1%) | 12 (7.1%) |
Health Status | ||||
Healthy Subject | 16 (100%) | 0 (0%) | 0 (0%) | 16 (9.4%) |
Patient | 0 (0%) | 43 (100%) | 111 (100%) | 154 (90.6%) |
Age (y) | ||||
Mean (SD) | 35.4 (6.09) | 48.7 (18.1) | 47.4 (17.1) | 46.6 (17.0) |
Median (CV%) | 34.0 (17.2) | 45.0 (37.1) | 47.0 (36.0) | 43.5 (36.4) |
[Min, Max] | [28.0, 45.0] | [19.0, 79.0] | [20.0, 80.0] | [19.0, 80.0] |
Age Group | ||||
< 65 years | 16 (100%) | 34 (79.1%) | 88 (79.3%) | 138 (81.2%) |
≥ 65 years | 0 (0%) | 9 (20.9%) | 23 (20.7%) | 32 (18.8%) |
Body Weight (kg) | ||||
Mean (SD) | 76.7 (15.6) | 73.0 (15.6) | 69.7 (15.1) | 71.2 (15.3) |
Median (CV%) | 72.5 (20.3) | 74.2 (21.4) | 69.0 (21.7) | 71.4 (21.6) |
[Min, Max] | [53.1, 108] | [45.0, 102] | [35.9, 119] | [35.9, 119] |
Height (cm) | ||||
Mean (SD) | 178 (7.91) | 169 (12.0) | 169 (10.0) | 170 (10.7) |
Median (CV%) | 178 (4.5) | 169 (7.1) | 168 (5.9) | 170 (6.3) |
[Min, Max] | [165, 196] | [142, 192] | [146, 192] | [142, 196] |
BMI (kg/m²) | ||||
Mean (SD) | 24.2 (3.94) | 25.8 (5.26) | 24.3 (4.09) | 24.7 (4.42) |
Median (CV%) | 25.2 (16.3) | 25.4 (20.4) | 24.0 (16.8) | 24.3 (17.9) |
[Min, Max] | [17.4, 29.5] | [16.1, 39.5] | [15.6, 38.8] | [15.6, 39.5] |
BSA (m²) | ||||
Mean (SD) | 1.94 (0.212) | 1.82 (0.224) | 1.79 (0.223) | 1.81 (0.225) |
Median (CV%) | 1.87 (10.9) | 1.79 (12.3) | 1.77 (12.4) | 1.79 (12.4) |
[Min, Max] | [1.58, 2.41] | [1.32, 2.24] | [1.25, 2.32] | [1.25, 2.41] |
BMI=body mass index; BSA=body surface area; Max=maximum; Min=minimum; N=number of subjects; SD=standard deviation.
table1(~ alb + alp + alt + ast + bili + crcl + fdarenal | study, data=dat,
footnote=abbrev_footnote("ALP", "ALT", "AST", "CrCL", "FDA", "SD", "Min", "Max", "N"))
Xyz-hv-01 (N=16) |
Xyz-ri-02 (N=43) |
Xyz-ph3-07 (N=111) |
Overall (N=170) |
|
---|---|---|---|---|
Albumin (g/L) | ||||
Mean (SD) | 46.2 (4.74) | 41.4 (4.27) | 41.8 (4.18) | 42.1 (4.44) |
Median (CV%) | 45.6 (10.2) | 41.6 (10.3) | 41.6 (10.0) | 42.0 (10.5) |
[Min, Max] | [39.1, 55.9] | [33.2, 49.4] | [32.7, 51.3] | [32.7, 55.9] |
ALP (U/L) | ||||
Mean (SD) | 74.0 (39.4) | 82.6 (35.9) | 82.4 (33.1) | 81.7 (34.3) |
Median (CV%) | 55.5 (53.3) | 73.2 (43.5) | 75.6 (40.1) | 74.2 (42.0) |
[Min, Max] | [29.8, 157] | [31.5, 177] | [21.2, 186] | [21.2, 186] |
ALT (U/L) | ||||
Mean (SD) | 26.4 (16.3) | 20.1 (11.7) | 21.9 (15.4) | 21.9 (14.7) |
Median (CV%) | 23.2 (61.6) | 17.7 (58.3) | 18.8 (70.4) | 19.0 (67.1) |
[Min, Max] | [10.8, 79.0] | [6.36, 73.7] | [4.43, 104] | [4.43, 104] |
AST (U/L) | ||||
Mean (SD) | 28.5 (16.0) | 25.1 (12.9) | 24.5 (12.0) | 25.0 (12.6) |
Median (CV%) | 25.9 (56.2) | 22.7 (51.4) | 21.0 (48.9) | 21.7 (50.3) |
[Min, Max] | [8.11, 62.1] | [6.24, 65.4] | [6.89, 69.3] | [6.24, 69.3] |
Bilirubin (µmol/L) | ||||
Mean (SD) | 10.0 (3.98) | 8.49 (3.72) | 10.0 (5.13) | 9.63 (4.74) |
Median (CV%) | 9.68 (39.7) | 8.24 (43.8) | 8.66 (51.3) | 8.64 (49.2) |
[Min, Max] | [4.20, 15.3] | [3.07, 18.3] | [2.44, 32.6] | [2.44, 32.6] |
CrCL (mL/min) | ||||
Mean (SD) | 131 (31.1) | 49.0 (16.1) | 112 (42.3) | 97.7 (46.4) |
Median (CV%) | 120 (23.7) | 50.1 (32.9) | 103 (37.8) | 94.3 (47.5) |
[Min, Max] | [96.1, 221] | [18.5, 95.9] | [38.2, 233] | [18.5, 233] |
Renal Impairment, FDA Classification | ||||
Normal | 16 (100%) | 1 (2.3%) | 76 (68.5%) | 93 (54.7%) |
Mild | 0 (0%) | 8 (18.6%) | 25 (22.5%) | 33 (19.4%) |
Moderate | 0 (0%) | 30 (69.8%) | 10 (9.0%) | 40 (23.5%) |
Severe | 0 (0%) | 4 (9.3%) | 0 (0%) | 4 (2.4%) |
ALP=alkaline phosphatase; ALT=alanine aminotransferase; AST=aspartate aminotransferase; CrCL=creatinine clearance; FDA=Food and Drug Administration; Max=maximum; Min=minimum; N=number of subjects; SD=standard deviation.
For continuous variables, the arithmetic mean, standard deviation, median, coefficient of variation, minimum and maximum are displayed on 3 lines. For categorical variable the frequency and percentage (within each column) are shown, one line per category. Missing values, if any, will also be shown as count and percent. The following rounding rules are applied:
Continuous and categorical variables can be mixed in the same table (note that age and age category are next to each other).
The above tables (including headings and footnotes) can be copied from the Chrome browser directly to a Word document (or Excel sheet). All formatting will be preserved (the one exception I have found is that in certain cases, superscripts or subscripts may be too small, so it might be necessary to reset the font site for the whole table to 10 pt). Pasting into PowerPoint works too, but does a less good a job at preserving the formatting, so it may be better to paste to a Word document first, and then copy that to PowerPoint (use the “Keep Source Formatting” option when pasting).
The complete example can be found on GitHub.
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] table1c_0.1 table1_1.2 rmarkdown_2.2 nvimcom_0.9-102
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.1 magrittr_1.5 htmltools_0.5.0 tools_4.0.1
## [5] yaml_2.2.1 stringi_1.4.6 knitr_1.28 Formula_1.2-3
## [9] stringr_1.4.0 xfun_0.14 digest_0.6.25 rlang_0.4.6
## [13] evaluate_0.14