Map to LyProX Format#

Consumes raw data and transforms it into a CSV that LyProX understands.

To do so, it needs a dictionary that defines a mapping from raw columns to the LyProX style data format. See the documentation of the transform_to_lyprox() function for more information.

lyscripts.data.lyproxify.ensure_python_file(file: Path) Path[source]#

Check if the file is a Python file.

lyscripts.data.lyproxify.ensure_column_map(file: Path) Path[source]#

Ensure the Python file contains a COLUMN_MAP dictionary.

pydantic settings lyscripts.data.lyproxify.LyproxifyCLI[source]#

Bases: BaseCLI

Map any CSV file to the LyProX format with the help of a Python mapping dict.

Show JSON schema
{
   "title": "LyproxifyCLI",
   "description": "Map any CSV file to the LyProX format with the help of a Python mapping dict.",
   "type": "object",
   "properties": {
      "configs": {
         "default": [
            "config.yaml"
         ],
         "description": "Path to the YAML file(s) that contain the configuration(s). Configs from YAML files may be overwritten by command line arguments. When multiple files are specified, the configs are merged in the order they are given. Note that every config file must have a `version: 1` key in it.",
         "items": {
            "format": "path",
            "type": "string"
         },
         "title": "Configs",
         "type": "array"
      },
      "input_file": {
         "description": "Location of raw CSV data.",
         "format": "file-path",
         "title": "Input File",
         "type": "string"
      },
      "num_header_rows": {
         "default": 1,
         "description": "Number of rows comprising the header of the raw CSV file.",
         "title": "Num Header Rows",
         "type": "integer"
      },
      "mapping_file": {
         "description": "Location of Python file containing a `COLUMN_MAP` dictionary. It may also contain an `EXCLUDE` list of tuples `(column, check)` to exclude patients.",
         "format": "file-path",
         "title": "Mapping File",
         "type": "string"
      },
      "drop_rows": {
         "default": [],
         "description": "Delete rows of specified indices. Counting of rows start at 0 _after_ the `header-rows`.",
         "items": {
            "type": "integer"
         },
         "title": "Drop Rows",
         "type": "array"
      },
      "drop_cols": {
         "default": [],
         "description": "Delete columns of specified indices.",
         "items": {
            "type": "integer"
         },
         "title": "Drop Cols",
         "type": "array"
      },
      "output_file": {
         "description": "Location to store the lyproxified CSV file.",
         "format": "path",
         "title": "Output File",
         "type": "string"
      }
   },
   "required": [
      "input_file",
      "mapping_file",
      "output_file"
   ]
}

field input_file: Annotated[Path, PathType(path_type=file)] [Required]#

Location of raw CSV data.

field num_header_rows: int = 1#

Number of rows comprising the header of the raw CSV file.

field mapping_file: Annotated[Path, PathType(path_type=file), AfterValidator(func=ensure_python_file), AfterValidator(func=ensure_column_map)] [Required]#

Location of Python file containing a COLUMN_MAP dictionary. It may also contain an EXCLUDE list of tuples (column, check) to exclude patients.

field drop_rows: list[int] = []#

Delete rows of specified indices. Counting of rows start at 0 _after_ the header-rows.

field drop_cols: list[int] = []#

Delete columns of specified indices.

field output_file: Path [Required]#

Location to store the lyproxified CSV file.

cli_cmd() None[source]#

Start the lyproxify subcommand.

After reading in the specified file, it will first drop_rows and drop_cols, as specified in the command line arguments. Then, it will call exclude_patients() which will further remove patients based on the EXCLUDE object in the mapping_file. Finally, it will call transform_to_lyprox() to transform the data into the LyProX format given the COLUMN_MAP object in the mapping_file.

exception lyscripts.data.lyproxify.ParsingError[source]#

Bases: Exception

Error while parsing the CSV file.

lyscripts.data.lyproxify.clean_header(table: DataFrame, num_cols: int, num_header_rows: int) DataFrame[source]#

Rename the header cells in the table.

lyscripts.data.lyproxify.get_instruction_depth(nested_column_map: dict[tuple, dict[str, Any]]) int[source]#

Get the depth at which the column mapping instructions are nested.

Instructions are a dictionary that contains either a ‘func’ or ‘default’ key.

>>> nested_column_map = {"patient": {"age": {"func": int}}}
>>> get_instruction_depth(nested_column_map)
2
>>> flat_column_map = flatten(nested_column_map, max_depth=2)
>>> get_instruction_depth(flat_column_map)
1
>>> nested_column_map = {"patient": {"__doc__": "some patient info", "age": 61}}
>>> get_instruction_depth(nested_column_map)
Traceback (most recent call last):
    ...
ValueError: Leaf of column map must be a dictionary with 'func' or 'default' key.
lyscripts.data.lyproxify.generate_markdown_docs(nested_column_map: dict[tuple, dict[str, Any]], depth: int = 0, indent_len: int = 4) str[source]#

Generate a markdown nested, ordered list as documentation for the column map.

A key in the doctionary is supposed to be documented, when its value is a dictionary containing a "__doc__" key.

>>> nested_column_map = {
...     "patient": {
...         "__doc__": "some patient info",
...         "age": {
...             "__doc__": "age of the patient",
...             "func": int,
...             "columns": ["age"],
...         },
...     },
... }
>>> generate_markdown_docs(nested_column_map)
'1. **`patient:`** some patient info\n    1. **`age:`** age of the patient\n'
lyscripts.data.lyproxify.transform_to_lyprox(raw: DataFrame, column_map: dict[tuple, dict[str, Any]]) DataFrame[source]#

Transform raw data into table that can be uploaded directly to LyProX.

To do so, it uses instructions in the colum_map dictionary, that needs to have a particular structure:

For each column in the final ‘lyproxified’ pd.DataFrame, one entry must exist in the column_map dictionary. E.g., for the column corresponding to a patient’s age, the dictionary should contain a key-value pair of this shape:

column_map = {
    ("patient", "core", "age"): {
        "func": compute_age_from_raw,
        "kwargs": {"randomize": False},
        "columns": ["birthday", "date of diagnosis"]
    },
}

In this example, the function compute_age_from_raw is called with the values of the columns "birthday" and "date of diagnosis" as positional arguments, and the keyword argument "randomize" is set to False. The function then returns the patient’s age, which is subsequently stored in the column ("patient", "core", "age").

Note that the column_map dictionary must have either a "default" key or "func" along with "columns" and "kwargs", depending on the function definition. If the function does not take any arguments, "columns" can be omitted. If it also does not take any keyword arguments, "kwargs" can be omitted, too.

lyscripts.data.lyproxify.leftright_to_ipsicontra(data: DataFrame)[source]#

Change absolute side reporting to tumor-relative.

Transform reporting of LNL involvement by absolute side (right & left) to a reporting relative to the tumor (ipsi- & contralateral). The table data should already be in the format LyProX requires, except for the side-reporting of LNL involvement.

lyscripts.data.lyproxify.exclude_patients(raw: DataFrame, exclude: list[tuple[str, Any]])[source]#

Exclude patients in the raw data based on a list of what to exclude.

The exclude list contains tuples (column, check). The check function will then exclude any patients from the cohort where check(raw[column]) evaluates to True.

>>> exclude = [("age", lambda s: s > 50)]
>>> table = pd.DataFrame({
...     "age":        [43, 82, 18, 67],
...     "T-category": [ 3,  4,  2,  1],
... })
>>> exclude_patients(table, exclude)
   age  T-category
0   43           3
2   18           2

Command Help#

Usage: lyscripts data lyproxify [-h] [--configs list[Path]]
                                [--input-file Path] [--num-header-rows int]
                                [--mapping-file Path] [--drop-rows list[int]]
                                [--drop-cols list[int]] [--output-file Path]

Map any CSV file to the LyProX format with the help of a Python mapping dict.

Options:
  -h, --help            show this help message and exit
  --configs list[Path]  Path to the YAML file(s) that contain the
                        configuration(s). Configs from YAML files may be
                        overwritten by command line arguments. When multiple
                        files are specified, the configs are merged in the
                        order they are given. Note that every config file must
                        have a `version: 1` key in it. (default:
                        ['config.yaml'])
  --input-file Path     Location of raw CSV data. (required)
  --num-header-rows int
                        Number of rows comprising the header of the raw CSV
                        file. (default: 1)
  --mapping-file Path   Location of Python file containing a `COLUMN_MAP`
                        dictionary. It may also contain an `EXCLUDE` list of
                        tuples `(column, check)` to exclude patients.
                        (required)
  --drop-rows list[int]
                        Delete rows of specified indices. Counting of rows
                        start at 0 _after_ the `header-rows`. (default: [])
  --drop-cols list[int]
                        Delete columns of specified indices. (default: [])
  --output-file Path    Location to store the lyproxified CSV file. (required)