Join Data Files#
Join multiple lymphatic progression datasets into a single dataset.
- pydantic settings lyscripts.data.join.JoinCLI[source]#
Bases:
BaseCLIJoin multiple lymphatic progression datasets into a single dataset.
Show JSON schema
{ "title": "JoinCLI", "description": "Join multiple lymphatic progression datasets into a single dataset.", "type": "object", "properties": { "configs": { "default": [ "config.yaml" ], "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.", "items": { "format": "path", "type": "string" }, "title": "Configs", "type": "array" }, "inputs": { "description": "The datasets to join.", "items": { "$ref": "#/$defs/DataConfig" }, "title": "Inputs", "type": "array" }, "output_file": { "description": "The path to the output dataset.", "format": "path", "title": "Output File", "type": "string" } }, "$defs": { "DataConfig": { "description": "Where to load lymphatic progression data from and how to feed it into a model.", "properties": { "source": { "anyOf": [ { "format": "file-path", "type": "string" }, { "$ref": "#/$defs/LyDataset" } ], "description": "Either a path to a CSV file or a config that specifies how and where to fetch the data from.", "title": "Source" }, "side": { "anyOf": [ { "enum": [ "ipsi", "contra" ], "type": "string" }, { "type": "null" } ], "default": null, "description": "Side of the neck to load data for. Only for Unilateral models.", "title": "Side" }, "mapping": { "additionalProperties": { "anyOf": [ { "type": "integer" }, { "type": "string" } ] }, "description": "Optional mapping of numeric T-stages to model T-stages.", "title": "Mapping", "type": "object" } }, "required": [ "source" ], "title": "DataConfig", "type": "object" }, "LyDataset": { "description": "Specification of a dataset.", "properties": { "year": { "description": "Release year of dataset.", "exclusiveMinimum": 0, "maximum": 2026, "title": "Year", "type": "integer" }, "institution": { "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.", "minLength": 1, "title": "Institution", "type": "string" }, "subsite": { "description": "Tumor subsite(s) patients in this dataset were diagnosed with.", "minLength": 1, "title": "Subsite", "type": "string" }, "repo_name": { "anyOf": [ { "minLength": 1, "type": "string" }, { "type": "null" } ], "default": "lycosystem/lydata", "description": "GitHub `repository/owner`.", "title": "Repo Name" }, "ref": { "anyOf": [ { "minLength": 1, "type": "string" }, { "type": "null" } ], "default": "main", "description": "Branch/tag/commit of the repo.", "title": "Ref" }, "local_dataset_dir": { "anyOf": [ { "format": "directory-path", "type": "string" }, { "type": "null" } ], "default": null, "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.", "title": "Local Dataset Dir" } }, "required": [ "year", "institution", "subsite" ], "title": "LyDataset", "type": "object" } }, "required": [ "inputs", "output_file" ] }
- field inputs: list[DataConfig] [Required]#
The datasets to join.
- cli_cmd() None[source]#
Start the
joinsubcommand.This will load all datasets specified in the
inputsattribute and concatenate them into a single dataset.Unfortunately, the use of pydantic does make this particular command a little bit more complicated (but also more powerful): If one simply wants to concatenate multiple datasets on disk, the
inputsshould be provided like this:lyscripts data join \ --inputs '{"source": "file1.csv"}' \ --inputs '{"source": "file2.csv"}' \ --output-file "joined.csv"
But it also allows for concatenating datasets fetched directly from the lydata Github repo. Due to the rather complex command signature, we recommend defining what to concatenate using a YAML file:
inputs: - data.year: 2021 data.institution: "usz" data.subsite: "oropharynx" - data.year: 2021 data.institution: "clb" data.subsite: "oropharynx"
Then, the command will look like this:
lyscripts data join --configs datasets.ly.yaml --output-file joined.csv
Command Help#
Usage: lyscripts data join [-h] [--configs list[Path]] [--inputs list[JSON]]
[--output-file Path]
Join multiple lymphatic progression datasets into a single dataset.
Options:
-h, --help show this help message and exit
--configs list[Path] Path to the YAML file(s) that contain the
configuration(s). Configs from YAML files may be
overwritten by command line arguments. When multiple
files are specified, the configs are merged in the
order they are given. Note that every config file must
have a `version: 1` key in it. (default:
['config.yaml'])
--inputs list[JSON] The datasets to join. (required)
--output-file Path The path to the output dataset. (required)