# Challenge Data

### Predicting lung cancer survival time by OWKIN

#### Description

##### Dates

Started on Jan. 6, 2020

# Context

## Owkin mission

Owkin works to improve patient treatment, drug discovery and augment medical research by developing machine-learning models on real-world clinical data. On this clinical data, our company build machine learning algorithms to predict toxicity, resistance, and sensitivity to treatments outcomes and diseases evolution. Owkin provides cutting edge expertise in machine learning technologies, specifically deep learning on histopathology and radiology images, as well as multimodal data integration with omics and clinical data. We believe that by combining large amounts of medical data with advanced machine learning algorithms, we can better understand diseases and heterogeneity of treatment effects. At the heart of the medicine of the future, OWKIN collaborates with pharmaceutical companies (Roche, Novartis, Amgen, Ipsen) and developed partnerships with hospitals and first-rank academic research laboratories, such as Institut Curie, Inserm, or Centre Léon Berard. Thanks to those outstanding partnerships, Owkin was able to recently publish breakthrough discovery in ​Nature Medicine​.

## Clinical context

The goal of this challenge is to develop new methods to predict survival time of patients diagnosed with lung cancer, based 3-dimensional radiology images (Computed Tomography scanner). Radiology data refers to various kinds of imaging modalities such as MRI, Echography or CT scans. In a clinical context, such modalities are very valuable because, unlike biopsies, those images are non invasive​: the pathology and the patient’s body remain untouched. Furthermore, those are much cheaper to operate and can, hence, be acquired at multiple timepoints and be used to track the disease through time. CT scan is a widely spread and popular exam in oncology: it reflects the density of the tissues of the human body. It is, then, adapted to the study of lung cancer because lungs are mostly filled with air (low density) while tumors are made of dense tissues.

Lung Cancer is one of the most represented cancer type (around 230.000 cases per year in the US) among which 80% are identified as Non Small Cell Lung Cancer (NSCLC). This subtype of lung cancer is characterized by peculiarly large tumors, hence called masses​, which can occupy significant portions of the total lung volumes, involving serious threats to the patient’s breathing abilities. Non Small Cell Lung Cancer can itself be split into four major subtypes based on histology observations: squamous cell carcinoma, large cell carcinoma, adenocarcinoma and a mixture of all.

When assessing the disease, radiologists will, among other aspects, look at the evolution through time of morphological features such as the gross tumour volume. However, being able to predict the survival of a patient from a single time point would be highly valuable as it would help in identifying prognostic features from radiology images, in addition to bringing valuable information at a low price in clinical trials. Features can be here understood as a very broad term which can refer to various characteristics of the tumor environment, such as its size, the density of the tumor, or even textural information extracted from the tumor zone. Designing those ​features​ requires advanced medical knowledge, and a lot of experience in prognosis / medical models. Hence, a large number of informative features have been identified by groups of medical experts in radio-oncology, related to colour or density heterogeneity, compactness, and many others.

# Goal

The challenge proposed by Owkin is a supervised survival prediction problem: predict the survival time of a patient (remaining days to live) from one three-dimensional CT scan (grayscale image) and a set of pre-extracted quantitative imaging features, as well as clinical data. To each patient corresponds one CT scan, and one binary segmentation mask. The segmentation mask is a binary volume of the same size as the CT scan, except that it is composed of zeroes everywhere there is no tumour, and 1 otherwise. This segmentation mask was built based on annotations from radiologists, who delineated by hand the tumor(s). Refer to the data section for further details of the data. The CT scans and the associated segmentation masks are subsets of two public datasets:
1. NSCLC Radiomics (subset of 285 patients)
2. NSCLC RadioGenomics(subset of 141 patients)

Both datasets can be found on The Cancer Imaging Archive. The endpoint of this challenge is the survival time in days. To this end, both training and validation contain for each patient, the time to event (days), as well as the censorship. Censorship indicates whether the event (death) was observed or whether the patient escaped the study: this can happen when the patient’s track was lost, or if the patient died of causes not related to the disease.

# Data

## Labels

Our problem is a supervised one as we have survival time (endpoint) over the whole training set. Labels for the train and valid datasets are in the file train/output.csv : survival time in days and censorship (1=death observed, 0=escaped from study).

## Inputs

For each patient, we provided four inputs:

1. Images (one scan and one mask per patient): we furnish crops of both the scan and the mask, centered around the tumor region. Those crops are of fixed 92^3 mm^3 size.
2. Radiomics features (an ensemble of 53 quantitative features per patient, extracted from the scan).
3. Clinical data

#### Images

Images are stored in the image folder, following the format: [patient_id].npz. Each .npz file contains both scan and mask (3-d arrays), accessible by :

import numpy as np
scan = archive['scan']


Radiomics features are stored at the root of the dataset folder in the file Radiomic_features.csv , where each row contains the 53 feature values, for each patient. Those features were extracted in order to ease the task for those whose time is limited. However, several studies point out that such hand-defined features can be **highly biased** and **suboptimal** for various tasks. The complete list of extracted features can be found in the zip file under the name *appendix.txt*. Those features were extracted using the pyradiomics python library which follows international standards on those features’ definition.

#### Clinical data

Clinical data contains basic meta-information for each patient. Namely, each patient is described by its age, the source data it originates from, and the TNM staging of the cancer. TNM staging is a known prognostic biomarker for survival (see the *Baseline* document for further description).

## Outputs

Outputs must be positive floats which represent the predicted survival time in days. The file train/output.csv simply contains two columns, indexed by patient id: one column represents the observed time to event, while the other represents the censorship (1=event observed is death, 0=event observed is the last time patient was seen alive, the patient then escaped the study).

import pandas as pd
p0 = train_output.loc[202]
print(p0.Event) # prints 1 or 0
print(p0.SurvivalTime)
# prints time to event (time to death or time to last known alive) in days


Solutions file on test input data submitted by participants shall follow the same format as the train output file with a nan-filled event column . This empty column is mandatory since input target and output predictions should follow the exact same format. Both datasets were mixed and shuffled before being split into three subsets:

1. Training set (70%)
2. Public test set (15%)
3. Private test set (15%)

In total the training set contains

 Train set Test set 1(public) Test set 2 (private) Patients from dataset 1 199 42 43 Patients from dataset 2 101 20 20

We would like to highlight the fact that 63 patients is a really small test set and therefore variability between the two test sets can be expected, especially if you are overfitting the first test set by using a lot of submissions. However, the small size of our datasets is a real life problem that we have to deal with and being able to overcome this barrier is a crucial point. A second very important source of poor transfer of performance between test sets is the disparity between centers. It would be wise to take into account that two centers, potentially really heterogenous, are present in the challenge.

# Baseline model for survival regression on NSCLC clinical data

## Model

### Baseline hazard

The model is a Cox proportional hazard (Cox-PH) model (see this description of the Cox proportional hazards model).
The model regresses the proportional risk (individual risk factor) of each patient in the dataset, based on the TNM staging (see wikipedia) of this patient at t=0. The proportional risk at time $t$ is formulated as follows:

$h(t|x)=h_0(t) e^{\beta^T x}$

where $h_0(t)$ is the baseline proportional risk factor, $x$ is a vector of covariates, and $\beta$ is the vector of fitted coefficients whose entries measure the impact of each covariate on the survival function.
A positive coefficient $\beta_k$ indicates a worse prognosis and a negative coefficient indicates a protective effect of the variable $x_k$ with which it is associated. They are the parameters one wants to find.

### Hazard ratio

The effects of covariates does not change over time, only the baseline hazard function does. It is the non parametric part of the model, it can take any form. However it vanishes when we take the ratio of hazards between two patients. The hazard ratio measures which patient is the most at risk and is constant over time (hence proportional hazards). Indeed, by taking the ratio of hazard functions of two patients, the baseline hazard $h_0(t)$ vanishes :

$\frac{h(t|x_i)}{h(t|x_j)}=e^{\beta^T (x_i - x_j)}$

This notion of proportional hazards imply hazard curves cannot cross and are proportional. In other words, if an individual has a risk of death at some initial time point that is twice as high as that of another individual, then at all later times the risk of death remains twice as high.

## Data

This baseline is trained on a selection of features from both clinical data file and radiomics file. A Cox-PH model was fitted on

1. Tumor sphericity, a measure of the roundness of the shape of the tumor region relative to a sphere, regardless its dimensions (size).
2. The tumor's surface to volume ratio is a measure of the compactness of the tumor, related to its size.
3. The tumor's maximum 3d diameter The biggest diameter measurable from the tumor volume
4. The dataset of origin
5. The N-tumoral stage grading of the tumor describing nearby (regional) lymph nodes involved
6. The tumor's joint entropy, specifying the randomness in the image pixel values
7. The tumor's inverse different, a measure of the local homogeneity of the tumor
8. The tumor's inverse difference moment is another measurement of the local homogeneity of the tumor

A finer definition of those features can be found here

# Performances

The baseline proposed reaches a Concordance Index (CI) of 0.691 on the public test set.

## Metric

The concordance index (C-index) is a generalization of the Area Under the Curve (or AUC) which can take into account censored data. This metric is commonly used to evaluate and compare predictive models with censored data. The C-index evaluates whether predicted survival times are ranked in the same order as their true survival time. Indeed because of censored data, the actual ordering of patient survival is not always available. A pair of patient is said “admissible” if patients can be ordered. The pair $(i, j)$ is admissible if patients $i$ and $j$ are not censored, or if patient $i$ dies at $t = k$ and patient $j$ is censored at $t > k$ . On the contrary, if both patients are censored or if patient $i$ died at $t = k′$ and patient $j$ is censored at $t < k′$ , the pair is not admissible. Indeed in that last case, patient $j$ may or may not have died before $k’$ .

The C-index is then defined as follows:

$\mathrm{C}_{\mathrm{index}}=\frac{\text{Number of concordant pairs}}{\text{Number of admissible pairs}}$

where concordant pairs are the admissible pairs of patients that are correctly classified. A score of 0.5 represents a random prediction and a score of 1.0 represents perfect predictions.

Note that the C-index does not compare absolute predicted survival times like usual regression metrics, but only relative survival times, by comparing ordering of the predictions and ordering of the true survival times.

#### Files

Files are accessible when logged in and registered to the challenge

#### The challenge provider

Machine learning for medical research