MCMC Sampling

MCMC Sampling#

Implementation of flexible MCMC sampling for lymphatic progression models.

This module provides both helpful functions for programmatically building and running sampling pipelines, as well a CLI interface for th most common sampling use cases.

The core is the run_sampling() function. It has a flexible interface and built-in convergence detection, as well as bookkeeping for monitoring and resuming interrupted sampling runs. It can be used both during the burn-in phase and the actual sampling phase.

Warning

We strongly recommend to set the CLI’s --cores argument to None (or null in the YAML config file) if you are on MacOS or Windows. This is because we haven’t yet figured out how we can safely and efficiently use the multiprocess(ing) library on these two platforms.

class lyscripts.sample.CompletedItersColumn(table_column=None, it: int = 0)[source]#

Bases: ProgressColumn

A column that displays the completed number of iterations.

render(task: Task) → Text[source]#: Render total iterations.

class lyscripts.sample.ItersPerSecondColumn(table_column: Column | None = None)[source]#

Bases: ProgressColumn

A column that displays the number of iterations per second.

render(task: Task) → Text[source]#: Render iterations per second.

pydantic model lyscripts.sample.AcorTime[source]#

Bases: BaseModel

Storage for old and new autocorrelation times.

Show JSON schema

{
   "title": "AcorTime",
   "description": "Storage for old and new autocorrelation times.",
   "type": "object",
   "properties": {
      "old": {
         "title": "Old",
         "type": "number"
      },
      "new": {
         "title": "New",
         "type": "number"
      }
   },
   "required": [
      "old",
      "new"
   ]
}

field old: float [Required]#

field new: float [Required]#

update(new: float) → None[source]#: Update the autocorrelation time.

property relative_diff: float#: Get the relative difference between new and old autocorrelation time.

pydantic model lyscripts.sample.NumAccepted[source]#

Bases: BaseModel

Storage for old and new number of accepted proposals.

Show JSON schema

{
   "title": "NumAccepted",
   "description": "Storage for old and new number of accepted proposals.",
   "type": "object",
   "properties": {
      "old": {
         "title": "Old",
         "type": "integer"
      },
      "new": {
         "title": "New",
         "type": "integer"
      }
   },
   "required": [
      "old",
      "new"
   ]
}

field old: int [Required]#

field new: int [Required]#

update(new: int) → None[source]#: Update the number of accepted proposals.

property newly_accepted: int#: Get the number of newly accepted proposals.

lyscripts.sample.log_prob_fn(theta: Iterable[float] | dict[str, float], inverse_temp: float = 1.0) → tuple[float, float][source]#

Compute log-prob using global variables because of pickling.

An inverse temperature inverse_temp can be provided for thermodynamic integration.

lyscripts.sample.ensure_initial_state(sampler: EnsembleSampler) → ndarray[source]#

Try to extract a starting state from a sampler.

Create a random starting state if no one was found.

lyscripts.sample.ensure_history_table(file: Path | None) → DataFrame[source]#

Return the history table from a file or an empty DataFrame.

It will try to load a history at the given file location, but with a .tmp extension. This is the expected name and location of a history file that was stored during an interrupted sampling run.

If no file is found, an empty DataFrame is returned.

lyscripts.sample.update_history_table(history: DataFrame, history_file: Path | None, iteration: int, acor_time: float, accepted_frac: float, max_log_prob: float) → DataFrame[source]#: Update the history table with the current iteration’s information.

lyscripts.sample.is_converged(iteration: int, acor_time: AcorTime, trust_factor: float, relative_thresh: float) → bool[source]#

Check if the chain has converged based on the autocorrelation time.

The criterion is based on the relative change of the autocorrelation time and whether the autocorrelation extimate can be trusted. Essentially, we only trust the estimate if it is smaller than trust_factor times the current iteration.

More details can be found in the emcee documentation.

lyscripts.sample.run_sampling(sampler: EnsembleSampler, initial_state: ndarray | None = None, num_steps: int | None = None, thin_by: int = 1, check_interval: int = 100, trust_factor: float = 50.0, relative_thresh: float = 0.05, history_file: Path | None = None, reset_backend: bool = False, description: str = 'Burn-in phase') → None[source]#

Run MCMC sampling.

This will run the sampler either for num_steps steps or - if it set to None - until convergence. Convergence is determined once within a check_interval of steps by the is_converged() function. The convergence criterion is based on a trustworthy estimate of the autocorrelation time. This is elaborated in the emcee documentation.

Some bookkeeping parameters may be stored in a history_file. During sampling, the history is stored in a temporary file with the suffix .tmp. If the sampling is interrupted, the history and the last state of the sampler can be recovered and the sampling can be continued.

One may choose to reset_backend, e.g. in case the previous sampling was run until convergence and now one wants to store a length of the converged chain. This may also be thinned by a factor of thin_by (directly passed to the emcee.EnsembleSampler class).

class lyscripts.sample.DummyPool[source]#

Bases: object

Dummy class to allow for no multiprocessing.

lyscripts.sample.get_pool(num_cores: int | None) → Any | DummyPool[source]#

Get a multiprocess(ing) pool or DummyPool.

Returns a multiprocess(ing) pool with num_cores cores if num_cores is not None. Otherwise, a DummyPool is returned.

lyscripts.sample.init_sampler(settings: SampleCLI, ndim: int, pool: Any) → EnsembleSampler[source]#: Initialize the emcee.EnsembleSampler with the given settings.

pydantic settings lyscripts.sample.SampleCLI[source]#

Bases: BaseCLI

Use MCMC to infer distributions over model parameters from data.

Show JSON schema

{
   "title": "SampleCLI",
   "description": "Use MCMC to infer distributions over model parameters from data.",
   "type": "object",
   "properties": {
      "configs": {
         "default": [
            "config.yaml"
         ],
         "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
         "items": {
            "format": "path",
            "type": "string"
         },
         "title": "Configs",
         "type": "array"
      },
      "graph": {
         "$ref": "#/$defs/GraphConfig"
      },
      "model": {
         "$ref": "#/$defs/ModelConfig",
         "default": {
            "external_file": null,
            "class_name": "Unilateral",
            "constructor": "binary",
            "max_time": 10,
            "named_params": null,
            "kwargs": {}
         }
      },
      "distributions": {
         "additionalProperties": {
            "$ref": "#/$defs/DistributionConfig"
         },
         "default": {},
         "description": "Mapping of model T-categories to predefined distributions over diagnose times.",
         "title": "Distributions",
         "type": "object"
      },
      "modalities": {
         "additionalProperties": {
            "$ref": "#/$defs/ModalityConfig"
         },
         "default": {},
         "description": "Maps names of diagnostic modalities to their specificity/sensitivity.",
         "title": "Modalities",
         "type": "object"
      },
      "data": {
         "$ref": "#/$defs/DataConfig"
      },
      "sampling": {
         "$ref": "#/$defs/SamplingConfig"
      }
   },
   "$defs": {
      "DataConfig": {
         "description": "Where to load lymphatic progression data from and how to feed it into a model.",
         "properties": {
            "source": {
               "anyOf": [
                  {
                     "format": "file-path",
                     "type": "string"
                  },
                  {
                     "$ref": "#/$defs/LyDataset"
                  }
               ],
               "description": "Either a path to a CSV file or a config that specifies how and where to fetch the data from.",
               "title": "Source"
            },
            "side": {
               "anyOf": [
                  {
                     "enum": [
                        "ipsi",
                        "contra"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Side of the neck to load data for. Only for Unilateral models.",
               "title": "Side"
            },
            "mapping": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "string"
                     }
                  ]
               },
               "description": "Optional mapping of numeric T-stages to model T-stages.",
               "title": "Mapping",
               "type": "object"
            }
         },
         "required": [
            "source"
         ],
         "title": "DataConfig",
         "type": "object"
      },
      "DistributionConfig": {
         "description": "Configuration defining a distribution over diagnose times.",
         "properties": {
            "kind": {
               "default": "frozen",
               "description": "Parametric distributions may be updated.",
               "enum": [
                  "frozen",
                  "parametric"
               ],
               "title": "Kind",
               "type": "string"
            },
            "func": {
               "const": "binomial",
               "default": "binomial",
               "description": "Name of predefined function to use as distribution.",
               "title": "Func",
               "type": "string"
            },
            "params": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "number"
                     }
                  ]
               },
               "default": {},
               "description": "Parameters to pass to the predefined function.",
               "title": "Params",
               "type": "object"
            }
         },
         "title": "DistributionConfig",
         "type": "object"
      },
      "GraphConfig": {
         "description": "Specifies how the tumor(s) and LNLs are connected in a DAG.",
         "properties": {
            "tumor": {
               "additionalProperties": {
                  "items": {
                     "type": "string"
                  },
                  "type": "array"
               },
               "description": "Define the name of the tumor(s) and which LNLs it/they drain to.",
               "title": "Tumor",
               "type": "object"
            },
            "lnl": {
               "additionalProperties": {
                  "items": {
                     "type": "string"
                  },
                  "type": "array"
               },
               "description": "Define the name of the LNL(s) and which LNLs it/they drain to.",
               "title": "Lnl",
               "type": "object"
            }
         },
         "required": [
            "tumor",
            "lnl"
         ],
         "title": "GraphConfig",
         "type": "object"
      },
      "LyDataset": {
         "description": "Specification of a dataset.",
         "properties": {
            "year": {
               "description": "Release year of dataset.",
               "exclusiveMinimum": 0,
               "maximum": 2026,
               "title": "Year",
               "type": "integer"
            },
            "institution": {
               "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.",
               "minLength": 1,
               "title": "Institution",
               "type": "string"
            },
            "subsite": {
               "description": "Tumor subsite(s) patients in this dataset were diagnosed with.",
               "minLength": 1,
               "title": "Subsite",
               "type": "string"
            },
            "repo_name": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "lycosystem/lydata",
               "description": "GitHub `repository/owner`.",
               "title": "Repo Name"
            },
            "ref": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "main",
               "description": "Branch/tag/commit of the repo.",
               "title": "Ref"
            },
            "local_dataset_dir": {
               "anyOf": [
                  {
                     "format": "directory-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.",
               "title": "Local Dataset Dir"
            }
         },
         "required": [
            "year",
            "institution",
            "subsite"
         ],
         "title": "LyDataset",
         "type": "object"
      },
      "ModalityConfig": {
         "description": "Define a diagnostic or pathological modality.",
         "properties": {
            "spec": {
               "description": "Specificity of the modality.",
               "maximum": 1.0,
               "minimum": 0.5,
               "title": "Spec",
               "type": "number"
            },
            "sens": {
               "description": "Sensitivity of the modality.",
               "maximum": 1.0,
               "minimum": 0.5,
               "title": "Sens",
               "type": "number"
            },
            "kind": {
               "default": "clinical",
               "description": "Clinical modalities cannot detect microscopic disease.",
               "enum": [
                  "clinical",
                  "pathological"
               ],
               "title": "Kind",
               "type": "string"
            }
         },
         "required": [
            "spec",
            "sens"
         ],
         "title": "ModalityConfig",
         "type": "object"
      },
      "ModelConfig": {
         "description": "Define which of the ``lymph`` models to use and how to set them up.",
         "properties": {
            "external_file": {
               "anyOf": [
                  {
                     "format": "file-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to a Python file that defines a model.",
               "title": "External File"
            },
            "class_name": {
               "default": "Unilateral",
               "description": "Name of the model class to use.",
               "enum": [
                  "Unilateral",
                  "Bilateral",
                  "Midline"
               ],
               "title": "Class Name",
               "type": "string"
            },
            "constructor": {
               "default": "binary",
               "description": "Trinary models differentiate btw. micro- and macroscopic disease.",
               "enum": [
                  "binary",
                  "trinary"
               ],
               "title": "Constructor",
               "type": "string"
            },
            "max_time": {
               "default": 10,
               "description": "Max. number of time-steps to evolve the model over.",
               "title": "Max Time",
               "type": "integer"
            },
            "named_params": {
               "default": null,
               "description": "Subset of valid model parameters a sampler may provide in the form of a dictionary to the model instead of as an array. Or, after sampling, with this list, one may safely recover which parameter corresponds to which index in the sample.",
               "items": {
                  "type": "string"
               },
               "title": "Named Params",
               "type": "array"
            },
            "kwargs": {
               "additionalProperties": true,
               "default": {},
               "description": "Additional keyword arguments to pass to the model constructor.",
               "title": "Kwargs",
               "type": "object"
            }
         },
         "title": "ModelConfig",
         "type": "object"
      },
      "SamplingConfig": {
         "description": "Settings to configure the MCMC sampling.",
         "properties": {
            "storage_file": {
               "description": "Path to HDF5 file store results or load last state.",
               "format": "path",
               "title": "Storage File",
               "type": "string"
            },
            "history_file": {
               "anyOf": [
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to store the burn-in metrics (as CSV file).",
               "title": "History File"
            },
            "dataset": {
               "default": "mcmc",
               "description": "Name of the dataset in the HDF5 file.",
               "title": "Dataset",
               "type": "string"
            },
            "cores": {
               "anyOf": [
                  {
                     "exclusiveMinimum": 0,
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": 2,
               "description": "Number of cores to use for parallel sampling. If `None`, no parallel processing is used.",
               "title": "Cores"
            },
            "seed": {
               "default": 42,
               "description": "Seed for the random number generator.",
               "title": "Seed",
               "type": "integer"
            },
            "walkers_per_dim": {
               "default": 20,
               "description": "Number of walkers per parameter space dimension.",
               "title": "Walkers Per Dim",
               "type": "integer"
            },
            "check_interval": {
               "default": 50,
               "description": "Check for convergence each time after this many steps.",
               "title": "Check Interval",
               "type": "integer"
            },
            "trust_factor": {
               "default": 50.0,
               "description": "Trust the autocorrelation time only when it's smaller than this factor times the length of the chain.",
               "title": "Trust Factor",
               "type": "number"
            },
            "relative_thresh": {
               "default": 0.05,
               "description": "Relative threshold for convergence.",
               "title": "Relative Thresh",
               "type": "number"
            },
            "burnin_steps": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of burn-in steps to take. If None, burn-in runs until convergence.",
               "title": "Burnin Steps"
            },
            "num_steps": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": 100,
               "description": "Number of steps to take in the MCMC sampling.",
               "title": "Num Steps"
            },
            "thin_by": {
               "default": 10,
               "description": "How many samples to draw before for saving one.",
               "title": "Thin By",
               "type": "integer"
            },
            "inverse_temp": {
               "default": 1.0,
               "description": "Inverse temperature for thermodynamic integration. Note that this is not yet fully implemented.",
               "title": "Inverse Temp",
               "type": "number"
            }
         },
         "required": [
            "storage_file"
         ],
         "title": "SamplingConfig",
         "type": "object"
      }
   },
   "required": [
      "graph",
      "data",
      "sampling"
   ]
}

field graph: GraphConfig [Required]#

field model: ModelConfig = ModelConfig(external_file=None, class_name='Unilateral', constructor='binary', max_time=10, named_params=None, kwargs={})#

field distributions: dict[str, DistributionConfig] = {}#: Mapping of model T-categories to predefined distributions over diagnose times.

field modalities: dict[str, ModalityConfig] = {}#: Maps names of diagnostic modalities to their specificity/sensitivity.

field data: DataConfig [Required]#

field sampling: SamplingConfig [Required]#

cli_cmd() → None[source]#

Start the sample subcommand.

First, it will construct the model from the graph and model arguments. Then, it will add distributions over diagnose times via the dictionary from the distributions argument. It will also set sensitivity and specificity of diagnostic modalities via the dictionary provided through the modalities argument. Finally, it will load the patient data as specified via the data argument.

When the model is constructed, an emcee.EnsembleSampler is initialized (see init_sampler()) and run_sampling() is executed twice: once for the burn-in phase and once for the actual sampling phase. The sampling argument provides all necessary settings for the sampling.

Command Help#

Usage: lyscripts sample [-h] [--configs list[Path]] [--graph [JSON]]
                        [--graph.tumor dict[str,list[str]]]
                        [--graph.lnl dict[str,list[str]]] [--model [JSON]]
                        [--model.external-file {Path,null}]
                        [--model.class-name {Unilateral,Bilateral,Midline}]
                        [--model.constructor {binary,trinary}]
                        [--model.max-time int]
                        [--model.named-params Sequence[str]]
                        [--model.kwargs dict[str,Any]]
                        [--distributions dict[str,JSON]]
                        [--modalities dict[str,JSON]] [--data [JSON]]
                        [--data.source [{Path,JSON}]] [--data.source.year int]
                        [--data.source.institution str]
                        [--data.source.subsite str]
                        [--data.source.repo-name {str,null}]
                        [--data.source.ref {str,null}]
                        [--data.source.local-dataset-dir {Path,null}]
                        [--data.side {{ipsi,contra},null}]
                        [--data.mapping dict[{{0,1,2,3,4},str},{int,str}]]
                        [--sampling [JSON]] [--sampling.storage-file Path]
                        [--sampling.history-file {Path,null}]
                        [--sampling.dataset str] [--sampling.cores {int,null}]
                        [--sampling.seed int] [--sampling.walkers-per-dim int]
                        [--sampling.check-interval int]
                        [--sampling.trust-factor float]
                        [--sampling.relative-thresh float]
                        [--sampling.burnin-steps {int,null}]
                        [--sampling.num-steps {int,null}]
                        [--sampling.thin-by int]
                        [--sampling.inverse-temp float]

Use MCMC to infer distributions over model parameters from data.

Options:
  -h, --help            show this help message and exit
  --configs list[Path]  Path to the YAML file(s) that contain the
                        configuration(s). Configs from YAML files may be
                        overwritten by command line arguments. When multiple
                        files are specified, the configs are merged in the
                        order they are given. Note that every config file must
                        have a `version: 1` key in it. (default:
                        ['config.yaml'])
  --distributions dict[str,JSON]
                        Mapping of model T-categories to predefined
                        distributions over diagnose times. (default: {})
  --modalities dict[str,JSON]
                        Maps names of diagnostic modalities to their
                        specificity/sensitivity. (default: {})

Graph Options:
  Specifies how the tumor(s) and LNLs are connected in a DAG.

  --graph [JSON]        set graph from JSON string (default: {})
  --graph.tumor dict[str,list[str]]
                        Define the name of the tumor(s) and which LNLs it/they
                        drain to. (required)
  --graph.lnl dict[str,list[str]]
                        Define the name of the LNL(s) and which LNLs it/they
                        drain to. (required)

Model Options:
  Define which of the ``lymph`` models to use and how to set them up.

  --model [JSON]        set model from JSON string (default: {})
  --model.external-file {Path,null}
                        Path to a Python file that defines a model. (default:
                        None)
  --model.class-name {Unilateral,Bilateral,Midline}
                        Name of the model class to use. (default: Unilateral)
  --model.constructor {binary,trinary}
                        Trinary models differentiate btw. micro- and
                        macroscopic disease. (default: binary)
  --model.max-time int  Max. number of time-steps to evolve the model over.
                        (default: 10)
  --model.named-params Sequence[str]
                        Subset of valid model parameters a sampler may provide
                        in the form of a dictionary to the model instead of as
                        an array. Or, after sampling, with this list, one may
                        safely recover which parameter corresponds to which
                        index in the sample. (default: None)
  --model.kwargs dict[str,Any]
                        Additional keyword arguments to pass to the model
                        constructor. (default: {})

Data Options:
  Where to load lymphatic progression data from and how to feed it into a
  model.

  --data [JSON]         set data from JSON string (default: {})
  --data.side {{ipsi,contra},null}
                        Side of the neck to load data for. Only for Unilateral
                        models. (default: null)
  --data.mapping dict[{{0,1,2,3,4},str},{int,str}]
                        Optional mapping of numeric T-stages to model
                        T-stages. (default factory: DataConfig.<lambda>)

Data.Source Options:
  Specification of a dataset.

  --data.source [{Path,JSON}]
                        set data.source from JSON string (default: {})
  --data.source.year int
                        Release year of dataset. (required)
  --data.source.institution str
                        Institution's short code. E.g., University Hospital
                        Zurich: `usz`. (required)
  --data.source.subsite str
                        Tumor subsite(s) patients in this dataset were
                        diagnosed with. (required)
  --data.source.repo-name {str,null}
                        GitHub `repository/owner`. (default:
                        lycosystem/lydata)
  --data.source.ref {str,null}
                        Branch/tag/commit of the repo. (default: main)
  --data.source.local-dataset-dir {Path,null}
                        Path to directory containing all the dataset
                        subdirectories. So, e.g. if `path_on_disk` is
                        `~/datasets` and the dataset is `2023-clb-multisite`,
                        then the CSV file is expected to be at
                        `~/datasets/2023-clb-multisite/data.csv`. (default:
                        null)

Sampling Options:
  Settings to configure the MCMC sampling.

  --sampling [JSON]     set sampling from JSON string (default: {})
  --sampling.storage-file Path
                        Path to HDF5 file store results or load last state.
                        (required)
  --sampling.history-file {Path,null}
                        Path to store the burn-in metrics (as CSV file).
                        (default: null)
  --sampling.dataset str
                        Name of the dataset in the HDF5 file. (default: mcmc)
  --sampling.cores {int,null}
                        Number of cores to use for parallel sampling. If
                        `None`, no parallel processing is used. (default: 2)
  --sampling.seed int   Seed for the random number generator. (default: 42)
  --sampling.walkers-per-dim int
                        Number of walkers per parameter space dimension.
                        (default: 20)
  --sampling.check-interval int
                        Check for convergence each time after this many steps.
                        (default: 50)
  --sampling.trust-factor float
                        Trust the autocorrelation time only when it's smaller
                        than this factor times the length of the chain.
                        (default: 50.0)
  --sampling.relative-thresh float
                        Relative threshold for convergence. (default: 0.05)
  --sampling.burnin-steps {int,null}
                        Number of burn-in steps to take. If None, burn-in runs
                        until convergence. (default: null)
  --sampling.num-steps {int,null}
                        Number of steps to take in the MCMC sampling.
                        (default: 100)
  --sampling.thin-by int
                        How many samples to draw before for saving one.
                        (default: 10)
  --sampling.inverse-temp float
                        Inverse temperature for thermodynamic integration.
                        Note that this is not yet fully implemented. (default:
                        1.0)

MCMC Sampling

Contents

MCMC Sampling#

Command Help#