Infer Additional Data Columns#
Enhance the dataset by inferring additional columns from the data.
This is a command-line interface to the methods
combine() and
augment() of the
LyDataAccessor class.
- pydantic settings lyscripts.data.enhance.EnhanceCLI[source]#
Bases:
BaseCLIEnhance the dataset by inferring additional columns from the data.
Show JSON schema
{ "title": "EnhanceCLI", "description": "Enhance the dataset by inferring additional columns from the data.", "type": "object", "properties": { "configs": { "default": [ "config.yaml" ], "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.", "items": { "format": "path", "type": "string" }, "title": "Configs", "type": "array" }, "input": { "$ref": "#/$defs/DataConfig" }, "modalities": { "anyOf": [ { "additionalProperties": { "$ref": "#/$defs/ModalityConfig" }, "type": "object" }, { "type": "null" } ], "default": null, "title": "Modalities" }, "method": { "default": "max_llh", "enum": [ "max_llh", "rank" ], "title": "Method", "type": "string" }, "lnl_subdivisions": { "additionalProperties": { "items": { "type": "string" }, "type": "array" }, "default": { "I": [ "a", "b" ], "II": [ "a", "b" ], "V": [ "a", "b" ] }, "title": "Lnl Subdivisions", "type": "object" }, "output_file": { "title": "Output File", "type": "string" } }, "$defs": { "DataConfig": { "description": "Where to load lymphatic progression data from and how to feed it into a model.", "properties": { "source": { "anyOf": [ { "format": "file-path", "type": "string" }, { "$ref": "#/$defs/LyDataset" } ], "description": "Either a path to a CSV file or a config that specifies how and where to fetch the data from.", "title": "Source" }, "side": { "anyOf": [ { "enum": [ "ipsi", "contra" ], "type": "string" }, { "type": "null" } ], "default": null, "description": "Side of the neck to load data for. Only for Unilateral models.", "title": "Side" }, "mapping": { "additionalProperties": { "anyOf": [ { "type": "integer" }, { "type": "string" } ] }, "description": "Optional mapping of numeric T-stages to model T-stages.", "title": "Mapping", "type": "object" } }, "required": [ "source" ], "title": "DataConfig", "type": "object" }, "LyDataset": { "description": "Specification of a dataset.", "properties": { "year": { "description": "Release year of dataset.", "exclusiveMinimum": 0, "maximum": 2026, "title": "Year", "type": "integer" }, "institution": { "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.", "minLength": 1, "title": "Institution", "type": "string" }, "subsite": { "description": "Tumor subsite(s) patients in this dataset were diagnosed with.", "minLength": 1, "title": "Subsite", "type": "string" }, "repo_name": { "anyOf": [ { "minLength": 1, "type": "string" }, { "type": "null" } ], "default": "lycosystem/lydata", "description": "GitHub `repository/owner`.", "title": "Repo Name" }, "ref": { "anyOf": [ { "minLength": 1, "type": "string" }, { "type": "null" } ], "default": "main", "description": "Branch/tag/commit of the repo.", "title": "Ref" }, "local_dataset_dir": { "anyOf": [ { "format": "directory-path", "type": "string" }, { "type": "null" } ], "default": null, "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.", "title": "Local Dataset Dir" } }, "required": [ "year", "institution", "subsite" ], "title": "LyDataset", "type": "object" }, "ModalityConfig": { "description": "Define a diagnostic or pathological modality.", "properties": { "spec": { "description": "Specificity of the modality.", "maximum": 1.0, "minimum": 0.5, "title": "Spec", "type": "number" }, "sens": { "description": "Sensitivity of the modality.", "maximum": 1.0, "minimum": 0.5, "title": "Sens", "type": "number" }, "kind": { "default": "clinical", "description": "Clinical modalities cannot detect microscopic disease.", "enum": [ "clinical", "pathological" ], "title": "Kind", "type": "string" } }, "required": [ "spec", "sens" ], "title": "ModalityConfig", "type": "object" } }, "required": [ "input", "output_file" ] }
- field input: DataConfig [Required]#
- field modalities: dict[str, ModalityConfig] | None = None#
Command Help#
Usage: lyscripts data enhance [-h] [--configs list[Path]] [--input [JSON]]
[--input.source [{Path,JSON}]]
[--input.source.year int]
[--input.source.institution str]
[--input.source.subsite str]
[--input.source.repo-name {str,null}]
[--input.source.ref {str,null}]
[--input.source.local-dataset-dir {Path,null}]
[--input.side {{ipsi,contra},null}]
[--input.mapping dict[{{0,1,2,3,4},str},{int,str}]]
[--modalities {dict[str,JSON],null}]
[--method {max_llh,rank}]
[--lnl-subdivisions dict[str,list[str]]]
[--output-file str]
Enhance the dataset by inferring additional columns from the data.
Options:
-h, --help show this help message and exit
--configs list[Path] Path to the YAML file(s) that contain the
configuration(s). Configs from YAML files may be
overwritten by command line arguments. When multiple
files are specified, the configs are merged in the
order they are given. Note that every config file must
have a `version: 1` key in it. (default:
['config.yaml'])
--modalities {dict[str,JSON],null}
(default: null)
--method {max_llh,rank}
(default: max_llh)
--lnl-subdivisions dict[str,list[str]]
(default: {'I': ['a', 'b'], 'II': ['a', 'b'], 'V':
['a', 'b']})
--output-file str (required)
Input Options:
Where to load lymphatic progression data from and how to feed it into a
model.
--input [JSON] set input from JSON string (default: {})
--input.side {{ipsi,contra},null}
Side of the neck to load data for. Only for Unilateral
models. (default: null)
--input.mapping dict[{{0,1,2,3,4},str},{int,str}]
Optional mapping of numeric T-stages to model
T-stages. (default factory: DataConfig.<lambda>)
Input.Source Options:
Specification of a dataset.
--input.source [{Path,JSON}]
set input.source from JSON string (default: {})
--input.source.year int
Release year of dataset. (required)
--input.source.institution str
Institution's short code. E.g., University Hospital
Zurich: `usz`. (required)
--input.source.subsite str
Tumor subsite(s) patients in this dataset were
diagnosed with. (required)
--input.source.repo-name {str,null}
GitHub `repository/owner`. (default:
lycosystem/lydata)
--input.source.ref {str,null}
Branch/tag/commit of the repo. (default: main)
--input.source.local-dataset-dir {Path,null}
Path to directory containing all the dataset
subdirectories. So, e.g. if `path_on_disk` is
`~/datasets` and the dataset is `2023-clb-multisite`,
then the CSV file is expected to be at
`~/datasets/2023-clb-multisite/data.csv`. (default:
null)