Usage

Darwin may be executed locally, on Linux Grids, or as a combination of both (e.g., execute NONMEM models on grids and run search locally).

Execution Overview

Running search on local machine

The darwin.run_search function executes the candidate search for the optimal population model.

python -m darwin.run_search <template_path> <tokens_path> <options_path>

To execute, call the darwin.run_search function and provide the paths to the following files as arguments:

  1. Template file (e.g., template.txt) - basic shell for NONMEM control files

  2. Tokens file (e.g., tokens.json) - json file describing the dimensions of the search space and the options in each dimension

  3. Options file (e.g., options.json) - json file describing algorithm, run options, and post-run penalty code configurations.

See Required Files for additional details.

Alternatively, you may execute the darwin.run_search_in_folder function, specifying the path to the folder containing the template.txt, tokens.json, and options.json files as a single argument:

python -m darwin.run_search_in_folder <folder_path>

Note

Files must be named as template.txt, tokens.json, and options.json when using darwin.run_search_in_folder.

Stopping Execution

A running search can be stopped using following command:

python -m darwin.stop_search [-f] <project dir>|<options file>
You need to provide the path to the project folder or to the options file associated with the search you want to stop.
Optional flag specifies whether the search must be stopped immediately. If not set the search will stop after current model runs are finished.

Warning

Don’t force-stop GP during the ask stage. Either wait for it to finish (Done asking in the console output and/or messages.txt) or stop without -f flag.

Note

models.json will contain all model runs finished before interruption.

Execution on Linux Grids

The following requirements should be met in order to execute pyDarwin on Linux Grids.

  • You must have access to the grid system (e.g., you are able to connect to the system via terminal session).

  • You must make pyDarwin installation available for all grid nodes.

  • Your search project must be available for all grid nodes as well.

  • You should be familiar with your grid controller commands (e.g., how to submit a job, query finished jobs, and delete jobs).

  • You should be familiar with regular expressions e.g., for usage in "submit_job_id_re" and "poll_job_id_re" fields in options.json.

Note

If all grid nodes share the same file system, you can simply deploy pyDarwin in your home directory (always use virtual environment!).

There are two ways to utilize grids for search in pyDarwin:

  1. Run search locally, submit individual model runs to the grid (local search, grid model runs).

  2. Submit search to the grid, as well as all the model runs (grid search, grid model runs).

In both cases you need to setup grid settings in your options.json.

With either case, you can stop the search using darwin.stop_search. Just keep in mind that in the second case, it may not be very responsive (due to load/IO latency/grid deployment details), so be patient.

Note

Although it’s possible to submit a “local search with local model runs” to the grid, this is not suggested.

Search Info

python -m darwin.search_info <folder_path>

This command loads the search folder and shows the summary that looks like this:

[01:11:10] Changing directory to c:\workspace\fruitfly\examples\NONMEM\user\Example2
[01:11:10] Options file found at options.json
[01:11:10] Loading system options: c:\workspace\fruitfly\examples\user\options.json
[01:11:10] Template file found at template.txt
[01:11:10] Tokens file found at tokens.json
[01:11:10] Algorithm: GP
[01:11:10] Engine: NONMEM
[01:11:10] random_seed: 11
[01:11:10] Project dir: c:\workspace\fruitfly\examples\NONMEM\user\Example2
[01:11:10] Data dir: c:\workspace\fruitfly\examples\NONMEM\user\Example2
[01:11:10] Project working dir: C:\Users\jcook\pydarwin\Example2
[01:11:10] Project temp dir: C:\Users\jcook\pydarwin\Example2\temp
[01:11:10] Project output dir: C:\Users\jcook\pydarwin\Example2\output
[01:11:10] Key models dir: C:\Users\jcook\pydarwin\Example2\key_models
[01:11:10] Search space size: 12960
[01:11:10] Estimated number of models to run: 454

Required Files

The same 3 files are required for any search, whether EX, GA, GP, RF, GBRT, or PSO. Which algorithm is used is defined in the options file. The template file serves as a framework and looks similar to a NONMEM/NMTRAN control file. The tokens file specifies the range of “features” to be searched, and the options file specifies the algorithm, the fitness function, any R or Python code to be executed after the NONMEM execution, and other options related to execution. See Options List.

Template File

The template file is a plain ASCII text file. This file is the framework for the construction of the NONMEM control files. Typically, the structure will be quite similar to a NONMEM control file, with the usual blocks, e.g., $PROB, $INPUT, $DATA, $SUBS, $PK, $ERROR, $THETA, $OMEGA, $SIGMA, $EST. However, this format is completely flexible and entire blocks may be missing from the template file (to be provided from the tokens file).

Note

NONMEM does not allow the data set path ($DATA) to be longer than 80 characters and the path must be in quotes if it contains spaces (see Data Directory).

The difference between a standard NONMEM control file and the template file is that the user will define code segments in the template file that will be replaced by other text. These code segments are referred to as “token keys”. Token keys come in sets, and in most cases, several token keys will need to be replaced together to generate syntactically correct code. The syntax for a token key in the template file is:

{Token_stem[N]}

Where Token_stem is a unique identifier for that token set and N is the target text to be substituted. An example is instructive.

Example:

Assume the user would like to consider 1 compartment (ADVAN1) or 2 compartment (ADVAN3) models as a dimension of the search. The relevant template file for this might be:

$SUBS {ADVAN[1]}
.
.
$PK
.
.
.
{ADVAN[2]}
.
.
.
$THETA
(0,1) ; Volume - fixed THETA - always appears
(0,1) ; Clearance - fixed THETA - always appears
{ADVAN[3]}

Note that tokens nearly always come in sets. As in nearly all cases, several substitutions must be made to create correct syntax. For a one compartment model, the following substitutions would be made:

{ADVAN[1]} -> ADVAN1
{ADVAN[2]} -> ;; 1 compartment, no definition needed for K12 or K21
{ADVAN[3]} -> ;; 1 compartment, no initial estimate needed for K12 or K21

and for 2 compartment:

{ADVAN[1]} -> ADVAN3
{ADVAN[2]} -> K12 = THETA(ADVANA) ;; 2 compartment, need definition for K12 \n K21 = THETA(ADVANB)
ADVAN[3]} ->(0,0.5) ;; K12 THETA(ADVANA)  \n  (0,0.5) ;; K21 THETA(ADVANB)

Where \n is the new line character. These sets of tokens are called token sets (2 token sets in this example one for ADVAN1, one for ADVAN3). The group of token sets is called a token group. In this example, “ADVAN” is the token key. Each token group must have a unique token key. For the first set of options, the text “ADVAN1” is referred to as the token text. Each token set consists of key-text pairs: token keys (described above) and token text.

The token (consisting of “{” + token stem +[N] + “}” where N is an integer specifying which token text in the token set is to be substituted) in the template file is replaced by the token text, specified in the tokens file. Which set of token key-text pairs is substituted is determined by the search algorithm and provided in the phenotype.

Note that the THETA (and ETA and EPS) indices cannot be determined until the final control file is defined, as THETAs may be included in one token set, but missing in another token set. For this reason, all fixed initial estimates in the $THETA block MUST occur before the THETA values that are not fixed (e.g., are searched). This is so the algorithm can parse the resulting file and correctly calculate the appropriate THETA (and ETA and EPS) indices. Further, the text string index in the token (e.g., ADVANA and ADVANB) must be unique in the token groups. The most convenient way to ensure that the text string index is unique in the Token groups is to use the token stem as the THETA index (e.g., THETA(ADVAN) is the token stem is ADVAN). Additional characters (e.g., ADVANA, ADVANB) can be added if multiple THETA text indices are needed. Note that the permitted syntax for residual error is EPS() or ERR().

Special notes on structure of $THETA/$OMEGA/$SIGMA:

Parameter initial estimate blocks require special treatment. A template file will typically include 2 types of initial estimates:

  1. Fixed initial estimates - Initial estimates that are not searched, but will be copied from the template into ALL control files. These are the typical $THETA estimates, e.g.: (0,1) ; THETA(1) Clearance.

  2. Searched initial estimates - Initial estimates that are specified in tokens that may or may not be in any given control file, e.g., {ALAG[2]} where the text for the ALAG[2] token key is “(0,1) ;; THETA(ALAG) Absorption lag time”

Note

Fixed initial estimates MUST be placed before searched initial estimates

pyDarwin automatically determines the correct indices for any THETA/OMEGA/SIGMA elements that are part of the search options. In order to correctly number these, it first must determine the number of “fixed” (i.e., present in all models, not searched) elements for each. For this reason, the fixed elements of THETA/OMEGA/SIGMA must come before any searched elements in the $THETA/$OMEGA/$SIGMA blocks. pyDarwin counts the number of fixed initial estimates in (for example) the $THETA block, then starts numbering the searched THETAs with the next consecutive number. As is the case for NONMEM, the indices for fixed THETAs are determined entirely by their position in the $THETA block. That is, if the $THETA block is:

$THETA
0.1
0.1
0.1

These will be THETA(1) to THETA(3) and pyDarwin will start with an index of 4. Errors would occur if pyDarwin simply counted the number THETA/ETA/EPS values in the $PK, $ERROR, $MIX, $AES and $DES block if, for example, fixed THETAs were used in $THETA and they did not appear in $PK. Therefore, the sequencing of THETA/ETA/EPS indices is based on the values and positions in the initial values blocks. In addition, to correctly number the elements, pyDarwin needs a little more help finding the correct indices. Specifically, comments (text after a ‘;’) in the $THETA block and THETA initial estimate tokens MUST be used. If this text is not present, and more than one initial estimate was used in a token set (e.g., CL=THETA(VMAX)*CONC/(THETA(KM)+CONC), pyDarwin would not know which initial estimate is to be associated with THETA(EMAX) and which with THETA(KM).

Generally, it is less confusing to have a separate line for each initial estimate, with a comment for that initial estimate. However, multiple initial estimates can be put on a single line, with multiple ‘;’ separating the defining text (please ensure that you are following naming conventions outlined below).

Specifically, there are 3 ways to define a variable name (ETA/THETA/OMEGA):

  1. ; NAME (any amount of spaces before and after name)

  2. ; any text ETA(NAME) any text (no spaces between ETA and name or around name)

  3. ; any text ETA <on|ON> NAME any text (exactly one space between the words)

Any combination of those in one line must work:

<some complex definition> ; name1 ; name2 ; also ETA(name3) ; be aware that numbers count as well: ETA(4)

Here we have 4 variables (ETAs) which can be referred to as ETA 1 to 4. Keep in mind that every ETA(name) in the model text is replaced with ETA(<number>), even inside the definition block.

If this is not what you want, you may define it using another notation, or add something to the comment:

D = ETA(D1)*ETA(C)*ETA(A)*ETA(D2)
$OMEGA
0.1 ; ETA(D1)
0.1 ; A
0.1 ; ETA ON C
0.1 ; ETA(D2) D2 or ETA ON D2 or any other way that doesn't look like another definition

Which then becomes:

D = ETA(1)*ETA(3)*ETA(2)*ETA(4)
$OMEGA
0.1 ; ETA(1)
0.1 ; A
0.1 ; ETA ON C
0.1 ; ETA(4) D2 or ETA ON D2

Parenthesis with (lower bound, initial value, upper bound) may also be used, as illustrated below:

$PK
D = THETA(1)*THETA(3)*THETA(2)*THETA(4)
$OMEGA
(0,0.1,10) ; THETA(1)
(0,0.1) ; THETA(A)
(0.1) ; THETA(3)
0.1 ; THETA(4)

Tokens File

The tokens file provides a dictionary (as a JSON file) of token key-text pairs. The highest level of the dictionary is the token group. Token groups are defined by a unique token stem. The token stem also typically serves as the key in the token key-text pairs. The token stem is a text string that corresponds to the token key that appears in the template file. The 2nd level in the tokens dictionary is the token sets. In the template file the tokens are indexed (e.g., ADVAN[1]), as typically multiple token keys will need to be replaced by text to create correct syntax. For example, if the search is for 1 compartment (ADVAN1) vs 2 compartment (ADVAN3), for ADVAN3, definitions of K23 and K32 must be provided in the $PK block, and (typically) initial estimates must be provided in the $THETA block. Thus, a set of 3 replacements must be made, one in $SUBS, one in $PK, and one in $THETA. So, the token set for selection of number of compartments, for 1 compartment (first option) or 2 compartments (second option), will include the following JSON code:

"ADVAN": [
            ["ADVAN1 ",
                ";; 1 compartment, no definition needed for K12 or K21 ",
                ";; 1 compartment, no initial estimate needed for K12 or K21"
            ],
            ["ADVAN3 ",
                " K12 = THETA(ADVANA) ;; 2 compartment, need definition for K12 \n K21 = THETA(ADVANB)",
                "  (0,0.5) ;; K12 THETA(ADVANA)  \n  (0,0.5) ;; K21 THETA(ADVANB) "
            ],

Note that specification of the current model as one compartment or two is done by the search algorithm and provided in the model phenotype.

A diagram of the token structure is given below

_images/tokens.png

Note the “nested token” – a token (“{K23~WT[1]}”) within a token, circled in red. Any number of levels of nested tokens is permitted (but the logic becomes very difficult with more than one). pyDarwin will first substitute the full text into the template, then scans the resulting text again. This nested token will then be found and the text from the {K23~WT[1]} token set will be substituted.

Several notes:

1. The token stem is “ADVAN” and identifies the token group. This stem must be unique in the token groups. The token stem also typically serves as the token key in the token key-value pairs. In this example, three replacements must be made in the template, in $SUBS, $PK, and $THETA. In the template file, these will be coded as {ADVAN[1]}, {ADVAN[2]}, and {ADVAN[3]}. Note the curly braces, these are required in the template, but not the tokens file. The indices correspond to the indices of the tokens in the token set. In this case there are 3 token key-value pairs in each token set. There may be additional unused tokens (as may be the case with nested tokens) but each token in the template file must have a corresponding token key-value pair in the tokens file. There are 2 token sets in this token group, one coding for ADVAN1 and one coding for ADVAN3.

  1. New lines in JSON files are ignored. To code a new line, enter the newline escape character “\n”. Similarly, a tab is coded as “\t”.

  2. Comments are not permitted in JSON files. However, comments for the generated NMTRAN control file maybe included with the usual syntax “;”.

  3. There is no dependency on the sequence of token sets in the file, any order is acceptable, they need not be in the same order as they appear in the template file.

  4. All other JSON rules apply.

Special note on initial estimates

In order to parse the text in the initial estimates blocks (THETA, OMEGA, and SIGMA) the user MUST include token stem text as a NONMEM/NMTRAN comment (i.e., after “;”). There is no other way to identify which initial estimates are to be associated with which THETA. For example, if a token stem has two THETAs:

Effect = THETA(EMAX) * CONC/(THETA(EC50) + CONC)

for the text in the $PK block, then code to be put into the $THETA block will be:

"  (0,100) \t; THETA(EMAX) "
"  (0,1000) \t; THETA(EC50) "

Where \t is a tab. Without this THETA(EMAX) and THETA(EC50) as a comment, there would not be any way to identify which initial estimate is to be associated with which THETA. Note that NONMEM assigns THETAs by sequence of appearance in $THETA. Given that the actual indices for THETA cannot be determined until the control file is created, this approach would lead to ambiguity. Each initial estimate must be on a new line and include the THETA (or ETA or EPS) + parameter identifier.

Options File

A JSON file with key-value pairs specifying various options for executing pyDarwin. While some fields are mandatory, some are algorithm-specific, while others are only relevant for execution on Linux Grids.

See Options List for details.

Searching Omega Structure

In addition to specifying relations inside the template file and tokens file to define the search space, you may also search for different structures of the omega matrix given fields specified in options.json.

Note

Omega structure alone can be searched without any tokens for compartments, covariates, etc. If searching Omega submatrices, options for Omega band/block search must be additionally specified.

Omega structure is encoded by a set of separate genes: one of the genes represents the omega block pattern, another one is for the band width (only applicable to NONMEM models). The pattern is the index of a valid pattern composed by pyDarwin.

In case of independent omega search, the set is repeated as many times as the number of search blocks in the template.

Valid patterns are created based on the maximum omega search block length and maximum size of submatrices (specified through max_omega_sub_matrix, see Omega Submatrices Search for details) if applicable. For example, for search_block(A, B, C, D, E) and max_omega_sub_matrix = 4, pyDarwin will consider the following 16 patterns:

()
(A B C D E)
(A B)
(A B) (C D)
(A B) (C D E)
(A B) (D E)
(A B C)
(A B C) (D E)
(A B C D)
(B C)
(B C) (D E)
(B C D)
(B C D E)
(C D)
(C D E)
(D E)

Here the empty pattern, (), means there is no block Omega (i.e., everything is diagonal), and the variables enclosed by the parenthesis are the ones whose associated covariance matrix (Omega) is block (that is, for each pattern, only those variables whose Omega matrix is block are listed). For NONMEM models without submatrix search, the empty pattern is substituted with an extra value for band width gene (= 0).

The number of patterns for different combinations of max_omega_search_len (whose values listed in the first column) and max_omega_sub_matrix (whose values listed in the first row) can be found in the table below.

Number of patterns

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

2

3

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

6

8

8

8

8

8

8

8

8

8

8

8

8

8

8

5

9

14

16

16

16

16

16

16

16

16

16

16

16

16

16

6

14

25

30

32

32

32

32

32

32

32

32

32

32

32

32

7

22

45

57

62

64

64

64

64

64

64

64

64

64

64

64

8

35

82

109

121

126

128

128

128

128

128

128

128

128

128

128

9

56

150

209

237

249

254

256

256

256

256

256

256

256

256

256

10

90

275

402

465

493

505

510

512

512

512

512

512

512

512

512

11

145

505

774

913

977

1005

1017

1022

1024

1024

1024

1024

1024

1024

1024

12

234

928

1491

1794

1937

2001

2029

2041

2046

2048

2048

2048

2048

2048

2048

13

378

1706

2873

3526

3841

3985

4049

4077

4089

4094

4096

4096

4096

4096

4096

14

611

3137

5537

6931

7618

7937

8081

8145

8173

8185

8190

8192

8192

8192

8192

15

988

5769

10672

13625

15110

15809

16129

16273

16337

16365

16377

16382

16384

16384

16384

16

1598

10610

20570

26785

29971

31490

32193

32513

32657

32721

32749

32761

32766

32768

32768

By default, pyDarwin will try to search omega structure for each search block/band individually. This is only possible if all search blocks are placed in the template. If any search block is found in the tokens, the omega search will be performed uniformly, i.e. all search blocks will have the same pattern. Individual omega search will further increase the search space size. It can be turned off by setting individual_omega_search to false.

pyDarwin Outputs

Console output

After the search command is submitted, pyDarwin first verifies that the following files and executables are available:

  1. The template file

  2. The tokens file

  3. The options file

  4. nmfe??.bat - executes NONMEM

  5. The data file(s) for the first control that is initiated

  6. If post run R code is requested, Rscript.exe

The startup output also lists the location of:

  1. Data dir - folder where datasets are located. It is recommended that this be an absolute path

  2. Project working dir - folder where template, token and options files are located, this is not set by the user

  3. Project temp dir - root folder where model file will be found, if the option is not set to remove them

  4. Project output dir - folder where all the results files will be put, such as results.csv and Final* files

  5. Where intermediate output will be written (e.g., u:/user/example2/output/results.csv)

  6. Where models will be saved (e.g., u:/user/example2/working/models.json)

  7. NMFE??.bat (Windows) or nmfe?? (Linux) file

  8. Rscript.exe, if used

pyDarwin provides verbose output about whether individual models have executed successfully.

A typical line of output might be:

[16:22:11] Iteration = 1, Model     1,       Done,    fitness = 123.34,    message =  No important warnings

The columns in this output are:

[Time of completion] Iteration = Iteration/generation, Model     Model Number,       Final Status,    fitness = fitness/reward,    message =  Messages from NMTRAN

If there are messages from NONMEM execution, these will also be written to the console, as well as if execution failed, and, if request, if R execution failed.

If the remove_temp_dir is set to false, the NONMEM control file, output file and other key files can be found in {temp_dir}/Iteration/Model Number for debugging.

File output

The file output from pyDarwin is generated in real time. That is, as soon as a model is finished, the results are written to the results.csv and models.json files. Similarly, messages (what appears on the console output) are written continuously to the messages.txt file.

Messages.txt

The messages.txt file will be found in the working directory. This file’s content is the same as the console output.

models.json

The models.json will contain the key output from all models that are run. This is not a very user-friendly file, as it is fairly complex json. The primary (maybe only) use for this file is if a search is interrupted, it can be restarted, and the contents of this file read in, rather than rerunning all the models. If the goal is to make simple diagnostics of the search progress, the results.csv file is likely more useful.

results.csv

The results.csv file contains key information about all models that are run in a more user-friendly format. This file can be used to make plots to monitor progress of the search or to identify models that had unexpected results (crashes).

File Structure and Naming

NONMEM control, executable, and output file naming:

Saving NONMEM outputs

NONMEM generates a great deal of file output. For a search of perhaps up to 10,000 models, this can become an issue for disc space. By default, key NONMEM output files are retained. Most temporary files (e.g., FDATA, FCON) and the temp_dir are always removed to save disc space. In addition, the data file(s) are not copied to the run directory, but all models use the same copy of the data file(s).

Model/folder naming

A model stem is generated from the current generation/iteration and model number of the form NM_generation_model_num. For example, if this is iteration 2, model 3, the model stem would be NM_2_3. For the 1 bit downhill, the model stem is NM_generationDdownhillstep_modelnum, and for the 2 bit local search the model stem is NM_generationSdownhillstepSearchStep_modelnum. Final downhill model stem is NM_FNDDownhillStep_ModelNum. This model stem is then used to name the .exe file, the .mod file, the .lst file, etc. This results in unique names for all models in the search. Models are also frequently duplicated. Duplicated files are not rerun, and so those will not appear in the file structure.

Run folders are similarly named for the generation/iteration and model number. Below is a folder tree for Example 2.

_images/FileStructure.png