Filtering Datasets#

Filter a dataset according to some common criteria.

This is essentially a command line interface to building a query object and applying it to the dataset.

pydantic settings lyscripts.data.filter.FilterCLI[source]#

Bases: BaseCLI

In- or exclude patients where a certain column fulfills a certain condition.

Show JSON schema

{
   "title": "FilterCLI",
   "description": "In- or exclude patients where a certain column fulfills a certain condition.",
   "type": "object",
   "properties": {
      "configs": {
         "default": [
            "config.yaml"
         ],
         "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
         "items": {
            "format": "path",
            "type": "string"
         },
         "title": "Configs",
         "type": "array"
      },
      "input": {
         "$ref": "#/$defs/DataConfig"
      },
      "include": {
         "default": false,
         "description": "Include patients where the condition is met (default: exclude).",
         "title": "Include",
         "type": "boolean"
      },
      "column": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "string"
            }
         ],
         "description": "The column to filter by. May be a tuple of three strings, since data has a three-level header. If it is only one string, the lydata package tries to map that to a three-level header.",
         "title": "Column"
      },
      "operator": {
         "description": "The operator to use for comparison.",
         "enum": [
            "==",
            "!=",
            ">",
            "<",
            ">=",
            "<=",
            "in",
            "contains"
         ],
         "title": "Operator",
         "type": "string"
      },
      "value": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "integer"
            },
            {
               "type": "string"
            }
         ],
         "description": "The value to compare against.",
         "title": "Value"
      },
      "output_file": {
         "description": "The path to save the filtered dataset to.",
         "format": "path",
         "title": "Output File",
         "type": "string"
      }
   },
   "$defs": {
      "DataConfig": {
         "description": "Where to load lymphatic progression data from and how to feed it into a model.",
         "properties": {
            "source": {
               "anyOf": [
                  {
                     "format": "file-path",
                     "type": "string"
                  },
                  {
                     "$ref": "#/$defs/LyDataset"
                  }
               ],
               "description": "Either a path to a CSV file or a config that specifies how and where to fetch the data from.",
               "title": "Source"
            },
            "side": {
               "anyOf": [
                  {
                     "enum": [
                        "ipsi",
                        "contra"
                     ],
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Side of the neck to load data for. Only for Unilateral models.",
               "title": "Side"
            },
            "mapping": {
               "additionalProperties": {
                  "anyOf": [
                     {
                        "type": "integer"
                     },
                     {
                        "type": "string"
                     }
                  ]
               },
               "description": "Optional mapping of numeric T-stages to model T-stages.",
               "title": "Mapping",
               "type": "object"
            }
         },
         "required": [
            "source"
         ],
         "title": "DataConfig",
         "type": "object"
      },
      "LyDataset": {
         "description": "Specification of a dataset.",
         "properties": {
            "year": {
               "description": "Release year of dataset.",
               "exclusiveMinimum": 0,
               "maximum": 2026,
               "title": "Year",
               "type": "integer"
            },
            "institution": {
               "description": "Institution's short code. E.g., University Hospital Zurich: `usz`.",
               "minLength": 1,
               "title": "Institution",
               "type": "string"
            },
            "subsite": {
               "description": "Tumor subsite(s) patients in this dataset were diagnosed with.",
               "minLength": 1,
               "title": "Subsite",
               "type": "string"
            },
            "repo_name": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "lycosystem/lydata",
               "description": "GitHub `repository/owner`.",
               "title": "Repo Name"
            },
            "ref": {
               "anyOf": [
                  {
                     "minLength": 1,
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "main",
               "description": "Branch/tag/commit of the repo.",
               "title": "Ref"
            },
            "local_dataset_dir": {
               "anyOf": [
                  {
                     "format": "directory-path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Path to directory containing all the dataset subdirectories. So, e.g. if `path_on_disk` is `~/datasets` and the dataset is `2023-clb-multisite`, then the CSV file is expected to be at `~/datasets/2023-clb-multisite/data.csv`.",
               "title": "Local Dataset Dir"
            }
         },
         "required": [
            "year",
            "institution",
            "subsite"
         ],
         "title": "LyDataset",
         "type": "object"
      }
   },
   "required": [
      "input",
      "column",
      "operator",
      "value",
      "output_file"
   ]
}

field input: DataConfig [Required]#

field include: Annotated[bool, _CliImplicitFlag] = False#: Include patients where the condition is met (default: exclude).

field column: list[str] | str [Required]#: The column to filter by. May be a tuple of three strings, since data has a three-level header. If it is only one string, the lydata package tries to map that to a three-level header.

field operator: Literal['==', '!=', '>', '<', '>=', '<=', 'in', 'contains'] [Required]#: The operator to use for comparison.

field value: float | int | str [Required]#: The value to compare against.

field output_file: Path [Required]#: The path to save the filtered dataset to.

model_post_init(_FilterCLI__context)[source]#: Cast to float, if not possible int, if not possible str.

cli_cmd()[source]#

Execute the filter command.

This command uses the Q objects of the lydata library to filter the dataset according to the given criteria.

Command Help#

Usage: lyscripts data filter [-h] [--configs list[Path]] [--input [JSON]]
                             [--input.source [{Path,JSON}]]
                             [--input.source.year int]
                             [--input.source.institution str]
                             [--input.source.subsite str]
                             [--input.source.repo-name {str,null}]
                             [--input.source.ref {str,null}]
                             [--input.source.local-dataset-dir {Path,null}]
                             [--input.side {{ipsi,contra},null}]
                             [--input.mapping dict[{{0,1,2,3,4},str},{int,str}]]
                             [--include | --no-include]
                             [--column {list[str],str}]
                             [--operator {==,!=,>,<,>=,<=,in,contains}]
                             [--value {float,int,str}] [--output-file Path]

In- or exclude patients where a certain column fulfills a certain condition.

Options:
  -h, --help            show this help message and exit
  --configs list[Path]  Path to the YAML file(s) that contain the
                        configuration(s). Configs from YAML files may be
                        overwritten by command line arguments. When multiple
                        files are specified, the configs are merged in the
                        order they are given. Note that every config file must
                        have a `version: 1` key in it. (default:
                        ['config.yaml'])
  --include, --no-include
                        Include patients where the condition is met (default:
                        exclude). (default: False)
  --column {list[str],str}
                        The column to filter by. May be a tuple of three
                        strings, since data has a three-level header. If it is
                        only one string, the lydata package tries to map that
                        to a three-level header. (required)
  --operator {==,!=,>,<,>=,<=,in,contains}
                        The operator to use for comparison. (required)
  --value {float,int,str}
                        The value to compare against. (required)
  --output-file Path    The path to save the filtered dataset to. (required)

Input Options:
  Where to load lymphatic progression data from and how to feed it into a
  model.

  --input [JSON]        set input from JSON string (default: {})
  --input.side {{ipsi,contra},null}
                        Side of the neck to load data for. Only for Unilateral
                        models. (default: null)
  --input.mapping dict[{{0,1,2,3,4},str},{int,str}]
                        Optional mapping of numeric T-stages to model
                        T-stages. (default factory: DataConfig.<lambda>)

Input.Source Options:
  Specification of a dataset.

  --input.source [{Path,JSON}]
                        set input.source from JSON string (default: {})
  --input.source.year int
                        Release year of dataset. (required)
  --input.source.institution str
                        Institution's short code. E.g., University Hospital
                        Zurich: `usz`. (required)
  --input.source.subsite str
                        Tumor subsite(s) patients in this dataset were
                        diagnosed with. (required)
  --input.source.repo-name {str,null}
                        GitHub `repository/owner`. (default:
                        lycosystem/lydata)
  --input.source.ref {str,null}
                        Branch/tag/commit of the repo. (default: main)
  --input.source.local-dataset-dir {Path,null}
                        Path to directory containing all the dataset
                        subdirectories. So, e.g. if `path_on_disk` is
                        `~/datasets` and the dataset is `2023-clb-multisite`,
                        then the CSV file is expected to be at
                        `~/datasets/2023-clb-multisite/data.csv`. (default:
                        null)

Filtering Datasets

Contents

Filtering Datasets#

Command Help#