Data Commands/Helpers#

Commands and functions for managing CSV data on patterns of lymphatic progression.

This contains helpful CLI commands that allow building quick and reproducible workflows even when using language-agnostic tools like Make or DVC.

Most of these commands can load LyProX style data from CSV files, but also from the installed datasets provided by the lydata package and directly from the associated GitHub repository.

Another cool feature is the built-in mini web application that allows collecting nodal involvement data interactively and in the same standardized format as we have published in the past, both on LyProX and in our GitHub repository. It can be launched by running lyscripts data collect in the terminal. See the docs for the lyscripts.data.collect submodule on more information.

pydantic settings lyscripts.data.DataCLI[source]#

Bases: BaseSettings

Work with lymphatic progression data through this CLI.

Show JSON schema
{
   "title": "DataCLI",
   "description": "Work with lymphatic progression data through this CLI.",
   "type": "object",
   "properties": {
      "collect": {
         "anyOf": [
            {
               "$ref": "#/$defs/CollectorCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "lyproxify": {
         "anyOf": [
            {
               "$ref": "#/$defs/LyproxifyCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "join": {
         "anyOf": [
            {
               "$ref": "#/$defs/JoinCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "split": {
         "anyOf": [
            {
               "$ref": "#/$defs/SplitCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "fetch": {
         "anyOf": [
            {
               "$ref": "#/$defs/FetchCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "filter": {
         "anyOf": [
            {
               "$ref": "#/$defs/FilterCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "enhance": {
         "anyOf": [
            {
               "$ref": "#/$defs/EnhanceCLI"
            },
            {
               "type": "null"
            }
         ]
      },
      "generate": {
         "anyOf": [
            {
               "$ref": "#/$defs/GenerateCLI"
            },
            {
               "type": "null"
            }
         ]
      }
   },
   "$defs": {
      "CollectorCLI": {
         "description": "Serve a FastAPI web app for collecting involvement patterns as CSV files.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "hostname": {
               "default": "localhost",
               "description": "Hostname to run the FastAPI app on.",
               "title": "Hostname",
               "type": "string"
            },
            "port": {
               "default": 8000,
               "description": "Port to run the FastAPI app on.",
               "title": "Port",
               "type": "integer"
            }
         },
         "title": "CollectorCLI",
         "type": "object"
      },
      "CrossValidationConfig": {
         "description": "Configs for splitting a dataset into cross-validation folds.",
         "properties": {
            "seed": {
               "default": 42,
               "description": "Seed for the random number generator.",
               "title": "Seed",
               "type": "integer"
            },
            "folds": {
               "default": 5,
               "description": "Number of folds to split the dataset into.",
               "title": "Folds",
               "type": "integer"
            }
         },
         "title": "CrossValidationConfig",
         "type": "object"
      },
      "DataConfig": {
         "description": "Where to load lymphatic progression data from and how to feed it into a model.",
         "properties": {
            "source": {
               "anyOf": [
                  {
                     "format": "file-path",
                     "type": "string"
                  },
                  {
                     "$ref": "#/$defs/LyDataset"
                  }
               ],
               "description": "Either a path to a CSV file or a config that specifies how and where to fetch the data from.",
               "title": "Source"
            },
            "side": {
               "anyOf": [
                  {
                     "enum": [
                        "ipsi",
                        "contra"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Side of the neck to load data for. Only for Unilateral models.",
               "title": "Side"
            },
            "mapping": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "string"
                     }
                  ]
               },
               "description": "Optional mapping of numeric T-stages to model T-stages.",
               "title": "Mapping",
               "type": "object"
            }
         },
         "required": [
            "source"
         ],
         "title": "DataConfig",
         "type": "object"
      },
      "DistributionConfig": {
         "description": "Configuration defining a distribution over diagnose times.",
         "properties": {
            "kind": {
               "default": "frozen",
               "description": "Parametric distributions may be updated.",
               "enum": [
                  "frozen",
                  "parametric"
               ],
               "title": "Kind",
               "type": "string"
            },
            "func": {
               "const": "binomial",
               "default": "binomial",
               "description": "Name of predefined function to use as distribution.",
               "title": "Func",
               "type": "string"
            },
            "params": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "number"
                     }
                  ]
               },
               "default": {},
               "description": "Parameters to pass to the predefined function.",
               "title": "Params",
               "type": "object"
            }
         },
         "title": "DistributionConfig",
         "type": "object"
      },
      "EnhanceCLI": {
         "description": "Enhance the dataset by inferring additional columns from the data.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "input": {
               "$ref": "#/$defs/DataConfig"
            },
            "modalities": {
               "anyOf": [
                  {
                     "additionalProperties": {
                        "$ref": "#/$defs/ModalityConfig"
                     },
                     "type": "object"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "title": "Modalities"
            },
            "method": {
               "default": "max_llh",
               "enum": [
                  "max_llh",
                  "rank"
               ],
               "title": "Method",
               "type": "string"
            },
            "lnl_subdivisions": {
               "additionalProperties": {
                  "items": {
                     "type": "string"
                  },
                  "type": "array"
               },
               "default": {
                  "I": [
                     "a",
                     "b"
                  ],
                  "II": [
                     "a",
                     "b"
                  ],
                  "V": [
                     "a",
                     "b"
                  ]
               },
               "title": "Lnl Subdivisions",
               "type": "object"
            },
            "output_file": {
               "title": "Output File",
               "type": "string"
            }
         },
         "required": [
            "input",
            "output_file"
         ],
         "title": "EnhanceCLI",
         "type": "object"
      },
      "FetchCLI": {
         "description": "Fetch a specific dataset from the lyDATA repository.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "year": {
               "description": "Release year of dataset.",
               "exclusiveMinimum": 0,
               "maximum": 2026,
               "title": "Year",
               "type": "integer"
            },
            "institution": {
               "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.",
               "minLength": 1,
               "title": "Institution",
               "type": "string"
            },
            "subsite": {
               "description": "Tumor subsite(s) patients in this dataset were diagnosed with.",
               "minLength": 1,
               "title": "Subsite",
               "type": "string"
            },
            "repo_name": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "lycosystem/lydata",
               "description": "GitHub `repository/owner`.",
               "title": "Repo Name"
            },
            "ref": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "main",
               "description": "Branch/tag/commit of the repo.",
               "title": "Ref"
            },
            "local_dataset_dir": {
               "anyOf": [
                  {
                     "format": "directory-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.",
               "title": "Local Dataset Dir"
            },
            "github_token": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "GitHub token to access private datasets. Can also be provided as `GITHUB_TOKEN` environment variable.",
               "title": "Github Token"
            },
            "github_user": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "GitHub user for non-token login. Can also be provided as `GITHUB_USER` environment variable.",
               "title": "Github User"
            },
            "github_password": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "GitHub password for non-token login. Can also be provided as `GITHUB_PASSWORD` environment variable.",
               "title": "Github Password"
            },
            "output_file": {
               "description": "The path to save the dataset to.",
               "format": "path",
               "title": "Output File",
               "type": "string"
            }
         },
         "required": [
            "year",
            "institution",
            "subsite",
            "output_file"
         ],
         "title": "FetchCLI",
         "type": "object"
      },
      "FilterCLI": {
         "description": "In- or exclude patients where a certain column fulfills a certain condition.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "input": {
               "$ref": "#/$defs/DataConfig"
            },
            "include": {
               "default": false,
               "description": "Include patients where the condition is met (default: exclude).",
               "title": "Include",
               "type": "boolean"
            },
            "column": {
               "anyOf": [
                  {
                     "items": {
                        "type": "string"
                     },
                     "type": "array"
                  },
                  {
                     "type": "string"
                  }
               ],
               "description": "The column to filter by. May be a tuple of three strings, since data has a three-level header. If it is only one string, the lydata package tries to map that to a three-level header.",
               "title": "Column"
            },
            "operator": {
               "description": "The operator to use for comparison.",
               "enum": [
                  "==",
                  "!=",
                  ">",
                  "<",
                  ">=",
                  "<=",
                  "in",
                  "contains"
               ],
               "title": "Operator",
               "type": "string"
            },
            "value": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "integer"
                  },
                  {
                     "type": "string"
                  }
               ],
               "description": "The value to compare against.",
               "title": "Value"
            },
            "output_file": {
               "description": "The path to save the filtered dataset to.",
               "format": "path",
               "title": "Output File",
               "type": "string"
            }
         },
         "required": [
            "input",
            "column",
            "operator",
            "value",
            "output_file"
         ],
         "title": "FilterCLI",
         "type": "object"
      },
      "GenerateCLI": {
         "description": "Settings for the command-line interface.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "graph": {
               "$ref": "#/$defs/GraphConfig"
            },
            "model": {
               "$ref": "#/$defs/ModelConfig",
               "default": {
                  "external_file": null,
                  "class_name": "Unilateral",
                  "constructor": "binary",
                  "max_time": 10,
                  "named_params": null,
                  "kwargs": {}
               }
            },
            "distributions": {
               "additionalProperties": {
                  "$ref": "#/$defs/DistributionConfig"
               },
               "default": {},
               "description": "Mapping of model T-categories to predefined distributions over diagnose times.",
               "title": "Distributions",
               "type": "object"
            },
            "t_stages_dist": {
               "additionalProperties": {
                  "type": "number"
               },
               "description": "Specify what fraction of generated patients should come from the respective T-Stage.",
               "title": "T Stages Dist",
               "type": "object"
            },
            "modalities": {
               "additionalProperties": {
                  "$ref": "#/$defs/ModalityConfig"
               },
               "title": "Modalities",
               "type": "object"
            },
            "params": {
               "additionalProperties": {
                  "type": "number"
               },
               "title": "Params",
               "type": "object"
            },
            "num_patients": {
               "default": 200,
               "title": "Num Patients",
               "type": "integer"
            },
            "output_file": {
               "title": "Output File",
               "type": "string"
            },
            "seed": {
               "default": 42,
               "title": "Seed",
               "type": "integer"
            }
         },
         "required": [
            "graph",
            "t_stages_dist",
            "modalities",
            "params",
            "output_file"
         ],
         "title": "GenerateCLI",
         "type": "object"
      },
      "GraphConfig": {
         "description": "Specifies how the tumor(s) and LNLs are connected in a DAG.",
         "properties": {
            "tumor": {
               "additionalProperties": {
                  "items": {
                     "type": "string"
                  },
                  "type": "array"
               },
               "description": "Define the name of the tumor(s) and which LNLs it/they drain to.",
               "title": "Tumor",
               "type": "object"
            },
            "lnl": {
               "additionalProperties": {
                  "items": {
                     "type": "string"
                  },
                  "type": "array"
               },
               "description": "Define the name of the LNL(s) and which LNLs it/they drain to.",
               "title": "Lnl",
               "type": "object"
            }
         },
         "required": [
            "tumor",
            "lnl"
         ],
         "title": "GraphConfig",
         "type": "object"
      },
      "JoinCLI": {
         "description": "Join multiple lymphatic progression datasets into a single dataset.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "inputs": {
               "description": "The datasets to join.",
               "items": {
                  "$ref": "#/$defs/DataConfig"
               },
               "title": "Inputs",
               "type": "array"
            },
            "output_file": {
               "description": "The path to the output dataset.",
               "format": "path",
               "title": "Output File",
               "type": "string"
            }
         },
         "required": [
            "inputs",
            "output_file"
         ],
         "title": "JoinCLI",
         "type": "object"
      },
      "LyDataset": {
         "description": "Specification of a dataset.",
         "properties": {
            "year": {
               "description": "Release year of dataset.",
               "exclusiveMinimum": 0,
               "maximum": 2026,
               "title": "Year",
               "type": "integer"
            },
            "institution": {
               "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.",
               "minLength": 1,
               "title": "Institution",
               "type": "string"
            },
            "subsite": {
               "description": "Tumor subsite(s) patients in this dataset were diagnosed with.",
               "minLength": 1,
               "title": "Subsite",
               "type": "string"
            },
            "repo_name": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "lycosystem/lydata",
               "description": "GitHub `repository/owner`.",
               "title": "Repo Name"
            },
            "ref": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "main",
               "description": "Branch/tag/commit of the repo.",
               "title": "Ref"
            },
            "local_dataset_dir": {
               "anyOf": [
                  {
                     "format": "directory-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.",
               "title": "Local Dataset Dir"
            }
         },
         "required": [
            "year",
            "institution",
            "subsite"
         ],
         "title": "LyDataset",
         "type": "object"
      },
      "LyproxifyCLI": {
         "description": "Map any CSV file to the LyProX format with the help of a Python mapping dict.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "input_file": {
               "description": "Location of raw CSV data.",
               "format": "file-path",
               "title": "Input File",
               "type": "string"
            },
            "num_header_rows": {
               "default": 1,
               "description": "Number of rows comprising the header of the raw CSV file.",
               "title": "Num Header Rows",
               "type": "integer"
            },
            "mapping_file": {
               "description": "Location of Python file containing a `COLUMN_MAP` dictionary. It may also contain an `EXCLUDE` list of tuples `(column, check)` to exclude patients.",
               "format": "file-path",
               "title": "Mapping File",
               "type": "string"
            },
            "drop_rows": {
               "default": [],
               "description": "Delete rows of specified indices. Counting of rows start at 0 _after_ the `header-rows`.",
               "items": {
                  "type": "integer"
               },
               "title": "Drop Rows",
               "type": "array"
            },
            "drop_cols": {
               "default": [],
               "description": "Delete columns of specified indices.",
               "items": {
                  "type": "integer"
               },
               "title": "Drop Cols",
               "type": "array"
            },
            "output_file": {
               "description": "Location to store the lyproxified CSV file.",
               "format": "path",
               "title": "Output File",
               "type": "string"
            }
         },
         "required": [
            "input_file",
            "mapping_file",
            "output_file"
         ],
         "title": "LyproxifyCLI",
         "type": "object"
      },
      "ModalityConfig": {
         "description": "Define a diagnostic or pathological modality.",
         "properties": {
            "spec": {
               "description": "Specificity of the modality.",
               "maximum": 1.0,
               "minimum": 0.5,
               "title": "Spec",
               "type": "number"
            },
            "sens": {
               "description": "Sensitivity of the modality.",
               "maximum": 1.0,
               "minimum": 0.5,
               "title": "Sens",
               "type": "number"
            },
            "kind": {
               "default": "clinical",
               "description": "Clinical modalities cannot detect microscopic disease.",
               "enum": [
                  "clinical",
                  "pathological"
               ],
               "title": "Kind",
               "type": "string"
            }
         },
         "required": [
            "spec",
            "sens"
         ],
         "title": "ModalityConfig",
         "type": "object"
      },
      "ModelConfig": {
         "description": "Define which of the ``lymph`` models to use and how to set them up.",
         "properties": {
            "external_file": {
               "anyOf": [
                  {
                     "format": "file-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to a Python file that defines a model.",
               "title": "External File"
            },
            "class_name": {
               "default": "Unilateral",
               "description": "Name of the model class to use.",
               "enum": [
                  "Unilateral",
                  "Bilateral",
                  "Midline"
               ],
               "title": "Class Name",
               "type": "string"
            },
            "constructor": {
               "default": "binary",
               "description": "Trinary models differentiate btw. micro- and macroscopic disease.",
               "enum": [
                  "binary",
                  "trinary"
               ],
               "title": "Constructor",
               "type": "string"
            },
            "max_time": {
               "default": 10,
               "description": "Max. number of time-steps to evolve the model over.",
               "title": "Max Time",
               "type": "integer"
            },
            "named_params": {
               "default": null,
               "description": "Subset of valid model parameters a sampler may provide in the form of a dictionary to the model instead of as an array. Or, after sampling, with this list, one may safely recover which parameter corresponds to which index in the sample.",
               "items": {
                  "type": "string"
               },
               "title": "Named Params",
               "type": "array"
            },
            "kwargs": {
               "additionalProperties": true,
               "default": {},
               "description": "Additional keyword arguments to pass to the model constructor.",
               "title": "Kwargs",
               "type": "object"
            }
         },
         "title": "ModelConfig",
         "type": "object"
      },
      "SplitCLI": {
         "description": "Split a dataset into cross-validation folds.",
         "properties": {
            "configs": {
               "default": [
                  "config.yaml"
               ],
               "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
               "items": {
                  "format": "path",
                  "type": "string"
               },
               "title": "Configs",
               "type": "array"
            },
            "input": {
               "$ref": "#/$defs/DataConfig"
            },
            "cross_validation": {
               "$ref": "#/$defs/CrossValidationConfig",
               "default": {
                  "seed": 42,
                  "folds": 5
               }
            },
            "output_dir": {
               "description": "The folder to store the split CSV files in.",
               "format": "path",
               "title": "Output Dir",
               "type": "string"
            }
         },
         "required": [
            "input",
            "output_dir"
         ],
         "title": "SplitCLI",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "collect",
      "lyproxify",
      "join",
      "split",
      "fetch",
      "filter",
      "enhance",
      "generate"
   ]
}

field collect: Annotated[CollectorCLI | None, _CliSubCommand] [Required]#
field lyproxify: Annotated[LyproxifyCLI | None, _CliSubCommand] [Required]#
field join: Annotated[JoinCLI | None, _CliSubCommand] [Required]#
field split: Annotated[SplitCLI | None, _CliSubCommand] [Required]#
field fetch: Annotated[FetchCLI | None, _CliSubCommand] [Required]#
field filter: Annotated[FilterCLI | None, _CliSubCommand] [Required]#
field enhance: Annotated[EnhanceCLI | None, _CliSubCommand] [Required]#
field generate: Annotated[GenerateCLI | None, _CliSubCommand] [Required]#
cli_cmd() None[source]#

Run one of the data subcommands.

Command Help#

Usage: lyscripts data [-h]
                      {collect,lyproxify,join,split,fetch,filter,enhance,generate}
                      ...

Work with lymphatic progression data through this CLI.

Options:
  -h, --help            show this help message and exit

Subcommands:
  {collect,lyproxify,join,split,fetch,filter,enhance,generate}
    collect             Serve a FastAPI web app for collecting involvement
                        patterns as CSV files.
    lyproxify           Map any CSV file to the LyProX format with the help of
                        a Python mapping dict.
    join                Join multiple lymphatic progression datasets into a
                        single dataset.
    split               Split a dataset into cross-validation folds.
    fetch               Fetch a specific dataset from the lyDATA repository.
    filter              In- or exclude patients where a certain column
                        fulfills a certain condition.
    enhance             Enhance the dataset by inferring additional columns
                        from the data.
    generate            Settings for the command-line interface.

Submodules#