Join Data Files#

Join multiple lymphatic progression datasets into a single dataset.

pydantic settings lyscripts.data.join.JoinCLI[source]#

Bases: BaseCLI

Join multiple lymphatic progression datasets into a single dataset.

Show JSON schema
{
   "title": "JoinCLI",
   "description": "Join multiple lymphatic progression datasets into a single dataset.",
   "type": "object",
   "properties": {
      "configs": {
         "default": [
            "config.yaml"
         ],
         "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
         "items": {
            "format": "path",
            "type": "string"
         },
         "title": "Configs",
         "type": "array"
      },
      "inputs": {
         "description": "The datasets to join.",
         "items": {
            "$ref": "#/$defs/DataConfig"
         },
         "title": "Inputs",
         "type": "array"
      },
      "output_file": {
         "description": "The path to the output dataset.",
         "format": "path",
         "title": "Output File",
         "type": "string"
      }
   },
   "$defs": {
      "DataConfig": {
         "description": "Where to load lymphatic progression data from and how to feed it into a model.",
         "properties": {
            "source": {
               "anyOf": [
                  {
                     "format": "file-path",
                     "type": "string"
                  },
                  {
                     "$ref": "#/$defs/LyDataset"
                  }
               ],
               "description": "Either a path to a CSV file or a config that specifies how and where to fetch the data from.",
               "title": "Source"
            },
            "side": {
               "anyOf": [
                  {
                     "enum": [
                        "ipsi",
                        "contra"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Side of the neck to load data for. Only for Unilateral models.",
               "title": "Side"
            },
            "mapping": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "string"
                     }
                  ]
               },
               "description": "Optional mapping of numeric T-stages to model T-stages.",
               "title": "Mapping",
               "type": "object"
            }
         },
         "required": [
            "source"
         ],
         "title": "DataConfig",
         "type": "object"
      },
      "LyDataset": {
         "description": "Specification of a dataset.",
         "properties": {
            "year": {
               "description": "Release year of dataset.",
               "exclusiveMinimum": 0,
               "maximum": 2026,
               "title": "Year",
               "type": "integer"
            },
            "institution": {
               "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.",
               "minLength": 1,
               "title": "Institution",
               "type": "string"
            },
            "subsite": {
               "description": "Tumor subsite(s) patients in this dataset were diagnosed with.",
               "minLength": 1,
               "title": "Subsite",
               "type": "string"
            },
            "repo_name": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "lycosystem/lydata",
               "description": "GitHub `repository/owner`.",
               "title": "Repo Name"
            },
            "ref": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "main",
               "description": "Branch/tag/commit of the repo.",
               "title": "Ref"
            },
            "local_dataset_dir": {
               "anyOf": [
                  {
                     "format": "directory-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.",
               "title": "Local Dataset Dir"
            }
         },
         "required": [
            "year",
            "institution",
            "subsite"
         ],
         "title": "LyDataset",
         "type": "object"
      }
   },
   "required": [
      "inputs",
      "output_file"
   ]
}

field inputs: list[DataConfig] [Required]#

The datasets to join.

field output_file: Path [Required]#

The path to the output dataset.

cli_cmd() None[source]#

Start the join subcommand.

This will load all datasets specified in the inputs attribute and concatenate them into a single dataset.

Unfortunately, the use of pydantic does make this particular command a little bit more complicated (but also more powerful): If one simply wants to concatenate multiple datasets on disk, the inputs should be provided like this:

lyscripts data join \
--inputs '{"source": "file1.csv"}' \
--inputs '{"source": "file2.csv"}' \
--output-file "joined.csv"

But it also allows for concatenating datasets fetched directly from the lydata Github repo. Due to the rather complex command signature, we recommend defining what to concatenate using a YAML file:

inputs:
  - data.year: 2021
    data.institution: "usz"
    data.subsite: "oropharynx"
  - data.year: 2021
    data.institution: "clb"
    data.subsite: "oropharynx"

Then, the command will look like this:

lyscripts data join --configs datasets.ly.yaml --output-file joined.csv

Command Help#

Usage: lyscripts data join [-h] [--configs list[Path]] [--inputs list[JSON]]
                           [--output-file Path]

Join multiple lymphatic progression datasets into a single dataset.

Options:
  -h, --help            show this help message and exit
  --configs list[Path]  Path to the YAML file(s) that contain the
                        configuration(s). Configs from YAML files may be
                        overwritten by command line arguments. When multiple
                        files are specified, the configs are merged in the
                        order they are given. Note that every config file must
                        have a `version: 1` key in it. (default:
                        ['config.yaml'])
  --inputs list[JSON]   The datasets to join. (required)
  --output-file Path    The path to the output dataset. (required)