Options List

Note that the options are saved to a JSON file. JSON supports string, numeric, and boolean (true|false). String options must be in quotes. See JSON for details.

Below is an example JSON file that demonstrates every possible option. Note, some settings are only applicable given algorithm selection and execution environment e.g., GA and grid options.

{
    "author": "Charles Robert Darwin",
    "project_name": "Delicious armadillos",

    "algorithm": "GA",

    "GA": {
        "elitist_num": 2,
        "crossover_rate": 0.95,
        "mutation_rate": 0.95,
        "sharing_alpha": 0.1,
        "selection": "tournament",
        "selection_size": 2,
        "crossover_operator": "cxOnePoint",
        "mutate": "flipBit",
        "attribute_mutation_probability": 0.1,
        "niche_penalty": 20
    },

    "PSO": {
        "inertia": 0.4,
        "cognitive": 0.5,
        "social": 0.5,
        "neighbor_num": 20,
        "p_norm": 2,
        "break_on_no_change": 5
    },

    "search_omega_blocks": false,
    "search_omega_bands": false,
    "max_omega_band_width": 0,
    "search_omega_sub_matrix": false,
    "max_omega_sub_matrix": 4,
    "individual_omega_search": true,
    "max_omega_search_len": 8,

    "random_seed": 11,
    "num_parallel": 4,
    "num_generations": 6,
    "population_size": 4,

    "num_opt_chains": 4,

    "exhaustive_batch_size": 100,

    "crash_value": 99999999,

    "penalty": {
        "theta": 10,
        "omega": 10,
        "sigma": 10,
        "convergence": 100,
        "covariance": 100,
        "correlation": 100,
        "condition_number": 100,
        "non_influential_tokens": 0.00001
    },

    "downhill_period": 2,
    "num_niches": 2,
    "niche_radius": 2,
    "local_2_bit_search": true,
    "final_downhill_search": true,

    "nmfe_path": "/opt/nm751/util/nmfe75",
    "model_run_timeout": 1200,
    "model_run_priority_class": "below_normal",

    "postprocess": {
        "use_r": true,
        "post_run_r_code": "{project_dir}/simplefunc.r",
        "r_timeout": 30,
        "use_python": true,
        "post_run_python_code": "{project_dir}/../simplefunc_common.py"
    },

    "use_saved_models": false,
    "saved_models_file": "{working_dir}/models0.json",
    "saved_models_readonly": false,

    "keep_key_models": false,
    "keep_best_models": true,
    "rerun_key_models": false,

    "remove_run_dir": false,
    "remove_temp_dir": true,

    "use_system_options": true,

    "model_cache": "darwin.MemoryModelCache",
    "model_run_man": "darwin.GridRunManager",
    "grid_adapter": "darwin.GenericGridAdapter",
    "engine_adapter": "nonmem",

    "rscript_path": "C:/Program Files/R/R-4.3.1/bin/Rscript.exe",
    "nlme_dir": "C:/Program Files/Certara/NLME_Engine",
    "gcc_dir": "C:/Program Files/Certara/mingw64",
    "nlme_license": "c:/workspace/lservrc",

    "working_dir": "~/darwin/Ex1",
    "data_dir": "{project_dir}/data",
    "output_dir": "{project_dir}/output",
    "temp_dir": "{working_dir}/temp",
    "key_models_dir": "{working_dir}/key_models",

    "generic_grid_adapter": {
        "python_path": "~/darwin/venv/bin/python",
        "submit_search_command": "qsub -b y -cwd -o {project_dir}/out.txt -e {project_dir}/err.txt -N '{project_name}'",
        "submit_command": "qsub -b y -o {results_dir}/{run_name}.out -e {results_dir}/{run_name}.err -N {job_name}",
        "submit_job_id_re": "Your job (\\w+) \\(\".+?\"\\) has been submitted",
        "poll_command": "qstat -s z",
        "poll_job_id_re": "^\\s+(\\w+)",
        "poll_interval": 5,
        "delete_command": "qdel {project_stem}-*"
    }
}

Description

Here is the list of all available options. Note that many of the options have default values and are not required to be specified directly in the options file.

  • authorstring: The author of the project.
    Aliased as {author}.
  • project_namestring: Name of the project. By default, it is set to the name of the parent folder of the options file.
    Aliased as {project_name}. See also {project_stem}.
  • GAJSON: Options specific to GA. Ignored for all other algorithms.

  • elitist_numpositive int: Number of best models from any generation to carry over, unchanged, to the next generation. Functions like Hall of Fame in DEAP.
    Default: 4
  • crossover_ratereal: Fraction of mating pairs that will undergo crossover (real 0.0–1.0).
    Default: 0.95
  • mutation_ratereal: Probability that at least one bit in the genome will be “flipped”, 0 to 1, or 1 to 0, (real 0.0–1.0).
    Default: 0.95
  • sharing_alphareal: Parameter of the niche penalty calculation.
    Default: 0.1
  • selectionstring: Selection algorithm for GA. Currently only “tournament” is available.
    Default: "tournament"
  • selection_sizepositive int: Number of “parents” to enter in the selection. 2 is highly recommended, experience with other values is very limited.
    Default: 2
  • crossover_operatorstring: The algorithm for crossover. Only “cxOnePoint” (single point crossover) is available.
    Default: "cxOnePoint"
  • mutatestring: The algorithm for mutation. Currently only “flipBit” is available.
    Default: "flipBit"
  • attribute_mutation_probabilityreal: Probability of any bit being mutated, (real 0.0–1.0).
    Default: 0.1
  • niche_penaltypositive real: Used for calculation of the crowding penalty. The niche penalty is calculated by first finding the “distance matrix”, the pair-wise Mikowski distance from the present model to all other models. The “crowding” quantity is then calculated as the sum of: (distance/niche_radius)**sharing_alpha for all other models in the generation for which the Mikowski distance is less than the niche radius.
    Finally, the penalty is calculated as: exp((crowding–1)*niche_penalty)–1. The objective of using a niche penalty is to maintain diversity of models, to avoid premature convergence of the search by penalizing when models are too similar to other models in the current generation.
    Default: 20
  • PSOJSON: Options specific to PSO. Ignored for all other algorithms.

  • inertiareal: Particle coordination movement as it relates to the previous velocity. Commonly denoted as \(\\w\).
    Default: 0.4
  • cognitivereal: Particle coordination movement as it relates to it’s own best known position. Commonly denoted as \(c_1\).
    Default: 0.5
  • socialreal: Particle coordination movement as it relates to current best known position across all particles. Commonly denoted as \(c_2\).
    Default: 0.5
  • neighbor_numpositive int: Number of neighbors that any particle interacts with to determine the social component of the velocity of the next step.
    Smaller number of neighbors results in a more thorough search (as the neighborhoods tend to move more independently, allowing the swarm to cover a larger
    section of the total search space) but will converge more slowly.
    Default: 20
  • p_normpositive int: Minkowski p-norm to use. A Value of 1 is the sum-of-absolute values (or L1 distance) while 2 is the Euclidean (or L2) distance.
    Default: 2
  • break_on_no_changepositive int: Number of iterations used to determine whether the optimization has converged.
    Default: 5
  • search_omega_blocksboolean: Set to true to search omega blocks. (Similar to search_omega_bands, but for NLME)
    Default: false
  • search_omega_bandsboolean: Set to true to search omega bands.
    Default: false
  • max_omega_band_widthpositive int: Maximum size of omega band to use in search.
    Default: 0
  • search_omega_sub_matrixboolean: Set to true to search omega sub matrix.
    Default: false
  • max_omega_sub_matrixpositive int: Maximum size of sub matrix to use in search.
    Default: 4
  • individual_omega_searchboolean: If true, every omega search block will be handled individually: each block will have a separate gene and max omega search length (either calculated automatically or set explicitly with max_omega_search_len).
    When individual_omega_search is set to false, the omega search will be performed uniformly, that is, all search blocks will have the same pattern of block omegas.
    Only search blocks placed directly in the template file can be calculated individually. If any search block is found in tokens, individual_omega_search is reset to false.
    Default: true
  • max_omega_search_lenint [2, 16]: Maximum amount of omegas in a single omega search block. If not set, it will be calculated automatically.
  • random_seedpositive int: A seed value for random number generator. Used by all machine learning algorithms.
    The random_seed is also used to generate off-diagonal estimates when Searching Omega Structure, regardless if using one of the machine learning algorithms or performing an Exhaustive Search.
  • num_parallelpositive int: Number of models to execute in parallel, i.e., how many threads to create to handle model runs.
    If the models are run locally, then it’s the maximum number of models running at the same time and should not exceed number of cores (logical/virtual processors).
    For grid runs, it’s the number of models to send to the queue and read from results at any given time. Execution itself is performed by grid nodes, so actual throughput is managed by the grid engine. In this case, 4 threads are enough.
    Default: 4
  • num_generations requiredpositive int: Number of iterations or generations of the search algorithm to run.
    Not used/required for EX.
  • population_size requiredpositive int: Number of models to create in every generation.
    Not used/required for EX.
  • num_opt_chains requiredpositive int: Number of parallel processes to perform the “ask” step (to increase performance).
    Required only for GP, RF and GBRT.
  • exhaustive_batch_sizepositive int: Since there are no iterations in EX, and the amount of all models in the search space might be enormous (millions?), the models are run in batches of more manageable size, so, essentially, EX is split into pseudo-iterations. This setting is the size of those batches. Several things to take into consideration when choosing the size:

    • typical value is 50 to 1000

    • in general, the size should be at least 10 to 20 times bigger than the number of models you can run in parallel

    • anything less than 50 is considered ineffective from CPU/grid utilization perspective, as all models in a batch must complete before the next batch starts

    • if you submit model runs to a grid, the size shouldn’t be too big to avoid overwhelming or monopolizing your grid queue

    • for local runs, you may batch as many models as you want if you don’t mind losing some cached models in case of any accident (model cache is dumped to a file at the end of every batch); unlike grid runs, any parallel searches won’t be affected by this setting since the main influence in the case of local runs is made by num_parallel

    Default: 100
  • crash_valuepositive real: Value of fitness or reward assigned when model output is not generated. Should be set larger than any anticipated completed model fitness.
    Default: 99999999
  • penaltyJSON:

  • thetareal: Penalty added to fitness/reward for each estimated THETA. A value of 3.84 corresponds to a hypothesis test with 1 df and p < 0.05 (for nested models) a value of 2 for 1 df corresponds to the Akaike information criterion
    Default: 10
  • omegareal: Penalty added to fitness/reward for each estimated OMEGA element
    Default: 10
  • sigmareal: Penalty added to fitness/reward for each estimated SIGMA element
    Default: 10
  • convergencereal: Penalty added to fitness/reward for failing to converge
    Default: 100
  • covariancereal: Penalty added to fitness/reward for failing the covariance step. If a successful covariance step is important, this can be set to a large value (e.g., 100), or if a successful covariance step is not at all important, set to 0. Note that if the covariance step is not requested, (and therefore cannot be successful), the penalty is added. If a covariance step is not requested, it is suggested that the covariance penalty be set to 0.
    Default: 100
  • correlationreal: Penalty added to fitness/reward if any off-diagonal element of the correlation matrix of estimate has absolute value > 0.95. This penalty will be added if the covariance step is requested but fails or if the covariance step is not requested (as in these cases, the off-diagonal elements are not available). If a covariance step is not requested, it is suggested that the correlation penalty be set to 0.
    Default: 100
  • condition_numberreal: Penalty added to fitness/reward if the covariance step fails or is not requested of if the covariance step is successful and the condition number is greater than 1000. If a covariance step is not requested, it is suggested that the condition_number penalty be set to 0.
    Default: 100
  • non_influential_tokensreal: Penalty added to fitness/reward if any tokens do not influence the control file (relevant for nested tokens). Should be very small (e.g., 0.0001), as the purpose is only for the model with non-influential tokens to be slightly worse than the same model without the non-influential token(s) to break a tie.
    Default: 0.00001
  • downhill_periodint: How often to run the downhill step. If < 1, no periodic downhill search will be performed.
    Default: -1
  • num_nichesint: Used for GA and downhill. A penalty is assigned for each model based on the number of similar models within a niche radius. This penalty is applied only to the selection process (not to the fitness of the model). The purpose is to insure maintaining a degree of diversity in the population (integer). num_niches is also used to select the number of models that are entered into the downhill step for all algorithms, except EX.
    Default: 2
  • niche_radiuspositive real: The radius of the niches. See “Niche Radius”.
    Default: 2
  • local_2_bit_searchboolean: Whether to perform the two bit local search. The two bit local search substantially increases the robustness of the search. All downhill local searches are done starting from num_niches models.
    Default: false
  • final_downhill_searchboolean: Whether to perform a local search (1 and 2 bit) at the end of the global search.
    Default: false
  • nmfe_path requiredstring: The command line for executing NONMEM. Usually, it’s a full path to nmfe script.
    Required if there are actual NONMEM model runs performed. It’s completely ignored until the first model run starts.
  • model_run_timeoutpositive real: Time (seconds) after which the NONMEM execution will be terminated, and the crash value assigned.
    Default: 1200
  • model_run_priority_class Windows only"normal" | "below_normal": Priority class for child processes that build and run models as well as run R postprocess script. below_normal is recommended to maintain user interface responsiveness.
    Default: "below_normal"
  • postprocessJSON:

  • use_rboolean: Whether user-supplied R code is to be run after NONMEM execution.
    Default: false
  • rscript_path deprecatedstring: Absolute path to Rscript.exe.
    Use rscript_path instead.
  • post_run_r_code requiredstring: Path to R file (.r extension) to be run after each NONMEM execution.
    Required if use_r is set to true.
    Available aliases are: all common aliases.
  • r_timeoutpositive real: Timeout (seconds) for R code execution.
    Default: 90
  • use_pythonboolean: Whether user-supplied Python code is to be run after NONMEM execution.
    Default: false
  • post_run_python_code requiredstring: Path to python code file (.py extension) to be run after each NONMEM execution.
    Required if use_python is set to true.
    Available aliases are: all common aliases.
  • use_saved_modelsboolean: Whether to restore saved Model Cache from file. The file is specified with saved_models_file.
    Default: false
  • saved_models_filestring: The file from which to restore Model Cache.
    Will only have an effect if use_saved_models is set to true.
    By default, the cache is saved in {working_dir}/models.json and cleared every time the search is started. To use saved runs, rename models.json or copy it to a different location.
    Available aliases are: all common aliases.

Warning

Don’t set saved_models_file to {working_dir}/models.json.

  • saved_models_readonlyboolean: Do not overwrite the saved_models_file content.
    Default: false
  • keep_key_modelsboolean: Whether to save the best model from every generation. Models are copied to key_models_dir.
    Default: false
  • keep_best_modelsboolean: Save only key models that improve fitness value, i.e. the models better than previous overall best model. Unlike keep_key_models this option may skip some generations.
    When set to true overrides keep_key_models to true as well.
    Default: true

Note

Since keep_best_models is on by default you have to set it to false explisitly if you want key models to be saved.

Note

keep_key_models/keep_best_models are not applicable to Exhaustive search.

  • rerun_key_modelsboolean: Sometimes saved key models don’t have any output:
    • when a model is restored from the cache file it has only fitness value

    • when a model is not better than the overall best model to the moment its run folder is cleaned up after the run

    In order to obtain desired output (e.g. tables) such models need to be re-run. To do so set rerun_key_models to true.
    All the models that don’t have their output stored will be re-run after the entire search.
    Default: false

Note

rerun_key_models doesn’t have effect if none of keep_key_models/keep_best_models is true.
  • remove_run_dirboolean: If true, will delete the entire model run directory, otherwise - only unnecessary files inside it.
    Default: false
  • remove_temp_dirboolean: Whether to delete entire temp_dir after the search is finished or stopped. Doesn’t have any effect when search is run on a grid.
    Default: false
  • use_system_optionsboolean: Whether to override options with environment-specific values.
    Default: true
  • model_cachestring: ModelCache subclass to be used.
    You can create your own and use it (e.g., a cache that stores model runs in a database. The name is quite arbitrary and doesn’t have any convention/constraints).
    Default: darwin.MemoryModelCache
  • model_run_manstring: ModelRunManager subclass to be used.
    Currently there are only darwin.LocalRunManager and darwin.GridRunManager.
    Default: darwin.LocalRunManager
  • grid_adapterstring: GridAdapter subclass to be used.
    Currently only darwin.GenericGridAdapter is available.
    Default: darwin.GenericGridAdapter
  • engine_adapterstring: ModelEngineAdapter subclass to be used.
    Currently nonmem and nlme are available.
    Default: nonmem
  • rscript_path requiredstring: Absolute path to Rscript.exe.
    Required if either of use_r is set to true or engine_adapter is set to nlme.
  • nlme_dir requiredstring: Absolute path to NLME Engine installation.
    Required if engine_adapter is set to nlme.
  • gcc_dir requiredstring: Absolute path to GCC root directory.
    Required if engine_adapter is set to nlme.
  • nlme_licensestring: Absolute path NLME license file.
  • working_dirstring: The project’s working directory, where all the necessary files and folders are created. Also, it’s a default location of output and temp folders.
    By default, it is set to ‘<pyDarwin home>/{project_stem}’.
    Aliased as {working_dir}.
    Available aliases are: {project_dir}.
  • data_dirstring: Directory where datasets are located. Must be available for individual model runs.
    Default: {project_dir}
    Aliased as {data_dir}.
    Available aliases are: {project_dir}, {working_dir}.
  • temp_dirstring: Parent directory for all model runs’ run directories, i.e., where all folders for every iteration is located.
    Default: {working_dir}/temp
    Aliased as {temp_dir}.
    Available aliases are: {project_dir}, {working_dir}.
  • key_models_dirstring: Directory where key/best models will be saved.
    Default: {working_dir}/key_modlels
    Available aliases are: {project_dir}, {working_dir}.
  • generic_grid_adapterJSON: These settings are necessary only when you use darwin.GridRunManager as model_run_man.
    For local runs this entire section is ignored.
  • python_path requiredstring: Path to your Python interpreter, preferably to the instance of the interpreter located in virtual environment where pyDarwin is deployed. The path must be available to all grid nodes that run jobs.

  • submit_command requiredstring: A command that submits individual runs to the grid queue. The actual command submitted to the queue is <python_path> -m darwin.run_model <input file> <output file> <options file>, but you don’t put it in the submit_command.
    Example: qsub -b y -o {results_dir}/{run_name}.out -e {results_dir}/{run_name}.err -N {job_name}
    Available aliases are: all common aliases, job submit aliases.
  • submit_search_command requiredstring: A command that submits search job to the grid queue. Similar to submit_command, but for entire search.
    Example: qsub -b y -cwd -o {project_stem}_out.txt -e {project_stem}_err.txt -N '{project_name}'
    Required only for grid search.
    Available aliases are: all common aliases, {darwin_cmd}.

Note

No directories are created at the point of submitting the search job. So even if it’s possible to use {working_dir}, {out_dir}, and {temp_dir} in submit_search_command, it’s not recommended. There may be cases where the directories do exist (if you set those settings to existing folders or run the search locally before submitting it to the grid), which is why these aliases are not prohibited.

  • submit_job_id_re requiredstring: A regular expression to find a job id in submit_command output. Job id must be captured with first capturing group.
    May look like this: Your job (\\w+) \\(\".+?\"\\) has been submitted
  • poll_command requiredstring: A command that retrieves finished jobs from grid controller. If your controller/setup allows you to specify ids/patterns in polling commands, do it (see delete_command). If it doesn’t, you must poll ALL finished jobs: qstat -s z
    Available aliases are: all common aliases, {job_ids}.
  • poll_job_id_re requiredstring: A regular expression to find a job id in every line of poll_command output. Similar to submit_job_id_re.

  • poll_intervalint: How often to poll jobs (seconds).
    Default: 10
  • delete_commandstring: A command that deletes all unfinished jobs related to the search when you stop it. It may delete all of them by id (qdel {job_ids}) or by mask (qdel {project_stem}-*).
    Available aliases are: all common aliases, {job_ids}.

Warning

Be careful when using a mask: if your mask matches the search job name, it may kill your search prematurely, e.g., during saving the cache.

Aliases

An alias is essentially a substitute text for some keyword. Their main purpose is to unify and to simplify configuration of various projects through different environments.
We encourage you to become familiar with them and to use them instead of explicit values, e.g., paths to your projects and their internals.

Common aliases

These aliases are applicable to several different options, so it’s easier to refer to them as a group.
They also can be used in templates.
  • {project_stem} – A file system friendly representation of the project name in a way that it will be easy to manage as a folder name where all non-letters and non-digits are replaced with underscores, i.e., Some reasonable(ish) name becomes Some_reasonable_ish__name. This cannot be set directly.

Grid job aliases

Job submit aliases

These aliases are only applicable to submit_command.

  • {results_dir} – Alias for the {working_dir}/run_results, where the results of individual runs are stored as ModelRun objects serialized to JSON files.

  • {job_name} – Alias for the default job name, which is {project_name}-{run_name}. Default here doesn’t mean it will be assigned to a job automatically, it’s up to the user to decide whether to use it or generate their own using other available aliases, e.g., {project_name}-{generation}-{run_number}.

  • {darwin_cmd} – Alias for the command sequence that runs a pyDarwin command. Depending on the context it can execute run_search, run_search_in_folder, or run_model. By default the sequence is added to the end of submit_command/submit_search_command. Using this alias you can put the sequence wherever you want in the command. For example Slurm requires the command being wrapped, so the settings may be like this:

    "generic_grid_adapter" : {
      "submit_search_command" : "sbatch -D {project_dir} --job-name '{project_name}' --output {project_stem}.out --error {project_stem}.err --wrap '{darwin_cmd}'",
      "submit_command" : "sbatch --job-name '{job_name}' --output {results_dir}/{run_name}.out --error {results_dir}/{run_name}.err --wrap {darwin_cmd}",
      "submit_job_id_re" : "Submitted batch job (\\d+)",
      "poll_command" : "squeue -t CD",
      "poll_job_id_re" : "^(\\d+)",
      "poll_interval" : 10,
      "delete_command" : "scancel {job_ids}"
    }
    

Note

Due to different mechanisms of calling the command {darwin_cmd} must be enclosed in single quotes for submit_search_command and not enclosed for submit_command.

Job delete/poll aliases

  • {job_ids} – Alias for a whitespace delimited list of ids of all unfinished jobs that were submitted from the current population.
    Can be used in: poll_command, delete_command.

Environment variables

There are a few environment variables that you may want to set in order to facilitate pyDarwin ease of use.

PYDARWIN_HOME

This environment variable allows you to change pyDarwin home to an arbitrary existing directory.

set PYDARWIN_HOME=C:\workspace\pydarwin

Note

It’s not advised to put pyDarwin home inside temp folder for a variety of reasons.

PYDARWIN_OPTIONS

This environment variable allows you to override settings.

set PYDARWIN_OPTIONS=C:\workspace\darwin\system_options.json

Settings override

At some point you may start running your projects in different environments. It may become quite annoying to edit nmfe_path and rscript_path every time you copy the project back and forth between Windows and Linux.

To avoid this, you can create a separate options file for every environment (even every user if you wish) and place all the environment-specific settings inside this file. Then, you can just set PYDARWIN_OPTIONS to the path of that file, and every setting from that file will override corresponding settings in any options.json of any project you run in that environment. Overriding can be switched off by use_system_options set to false.

Note

Set use_system_options in the project’s options.json, not in the common one.

Good candidates to put into common options file are:

  • nmfe_path

  • rscript_path

  • num_parallel

  • author

  • random_seed

Basically, any setting can be overridden. However, be cautious to not override the algorithm or penalties (unless this is intended).

When you override nested settings, you don’t have to specify every single value in the section, only those you want to be changed.

For example:

{
    "author": "Mark Sale",
    "project_name": "Example 11",

    "algorithm": "GA",

    "random_seed": 11,
    "num_parallel": 40,
    "num_generations": 14,
    "population_size": 140,

    "remove_run_dir": true,

    "nmfe_path": "C:/nm744/util/nmfe74.bat",

    "postprocess": {
        "use_r": true,
        "post_run_r_code": "{project_dir}/Cmaxppc.r",
        "rscript_path": "C:\\Program Files\\R\\R-4.1.3\\bin\\Rscript.exe"
    }
}

In terms of options priority, pyDarwin loads options.json, then system_options.json, then merges those two together so values from system_options overwrite the original ones. After that, all default values are applied, and resulting options values are used.

Note

When running models on a grid, individual models are run on different nodes (in different environments). You must ensure that you either override settings on every node, or don’t override it at all.