Challenge Data

Detecting PIK3CA mutation in breast cancer

Login to your account to try this challenge!



Started on Jan. 6, 2023

Challenge context

🔬 Histopathology

Histopathology is the study of the microscopic structure of diseased human tissue. Analysis of histopathology slides is a critical step for many diagnoses, specifically in oncology where it defines the gold standard. Tissue samples are usually collected during surgery or biopsy. After being preprocessed by expert technicians, pathologists review samples under a microscope in order to assess several biomarkers such as the nature of the tumor, cancer staging etc.

🧬 PIK3CA mutation in breast cancer

Recent studies have also shown that histopathology slides contain information that underlie tumor genotype, therefore they can be used to predict genomic alterations such as point mutations. One of the genomic alterations that is particularly interesting is PIK3CA mutation in breast cancer. They occur in around 30%-40% of breast cancer and are most commonly found in estrogen receptor-positive breast cancer. PIK3CA mutations have been associated with good outcomes. More importantly, patients who carry these mutations and are resistant to endocrine therapy may respond to a class of target therapy - the PI3Kα inhibitor.

🔍 Challenge's purpose

Current method for identifying PIK3CA mutations is DNA sequencing, which requires technical and bioinformatic expertise that is not accessible in all laboratories. An automated solution to detect PIK3CA mutation has high clinical relevance as it could provide a fast, reliable screening tool allowing more patients, especially in tertiary centers, to be eligible to personalized therapies associated to better outcomes.

Challenge goals

The challenge proposed by Owkin is a weakly-supervised binary classification problem. Weak supervision is crucial in digital pathology due to the extremely large dimensions of whole-slide images (WSIs), which cannot be processed as is. To use standard machine learning algorithms one needs, for each slide, to extract smaller images (called tiles) of size 224x224 pixels (approx 112 µm²). Since a slide is given a single binary annotation (presence or absence of mutation) and is mapped to a bag of tiles, one must learn a function that maps multiple items to a single global label. This framework is known as multiple-instance learning (MIL). More precisely, if one of the pooled tiles exhibits a mutation pattern, presence of mutation is predicted while if none of the tiles exhibit the pattern, absence of mutation is predicted. This approach alleviates the burden of obtaining locally annotated tiles, which can be costly or impractical for pathologists.

In this challenge, we aim to predict whether a patient has a mutation of the gene PIK3CA, directly from a slide. For computational purposes, we kept a total of 1,000 tiles per WSI. Each tile was selected such that there is tissue in it.

Here we display an example of whole slide image with 1,000 tiles highlighted in black.

An example of whole slide image

Figure 1: Example of a whole slide image with the 1,000 tiles selected during preprocessing highlighted in black

Some of those tiles are displayed below. The coordinates are indicated in parenthesis for each tile.

Examples of tiles

Figure 2: Example of 224x224 pixels tiles extracted at a 20x magnification with their (x, y)-coordinates

Data description


At the tissue sample scale, our problem is a supervised one as we have mutation data over the whole training set. Labels for the train dataset are in train_output.csv​ (0=​wildtype and 1=mutated​). At the tile scale, the problem is a weakly supervised one as we have one label per bag of tiles.


For each patient, we provided three types of input:

  • the set of (maximum 1,000) tile images randomly chosen inside the tissue as .jpg files
  • the feature vectors extracted from each of the tiles using a pre-trained resnet model
  • metadata related to the original slide image.


In the image folder (images) is stored one folder per sample, named sample_id containing RGB images of size 224x224x3 (i.e. 3D matrices) stored as .jpg files with the following names:


Each folder contains (up to) 1,000 tiles. The whole-slide images used in this challenge originally come from the TCGA-BRCA dataset.

Note that the use of external additional data is prohibited to ensure fairness among participants in the challenge.

MoCo v2 features

In the feature folder (moco_features) is stored one matrix per sample named [sample_id].npy. This matrix is of size Nₜ x 2,051 with Nₜ the number of tiles for the given sample. The first column is the zoom level, the second and third are coordinates of the tile in the slide. The last 2048 columns are the actual MoCo features.

Each matrix has 1,000 rows (one for each tile of corresponding image folder).

The MoCo v2 features have been extracted using a Wide ResNet-50-2 pre-trained on TCGA-COAD. We decided to extract these features to help people who do not have the computing resources or time to train directly from images.


Additionally, available metadata are provided as csv files (train_metadata.csv and test_metadata.csv), that contain the following columns: "Sample ID", "Center ID" that indicates the hospital of origin and "Patient ID", the unique identifier of patients, as some patients may have several slides.


Outputs must be float numbers between 0.0 and 1.0 which represent the probability of PIK3CA mutation. The train output file train_output.csv simply consists of two columns separated by commas denominated "Sample ID" and "Target" in the first line (header), where "Sample ID" refers to the unique identifier of the sample and "Target" indicates the presence (1) or absence (0) of PIK3CA mutation. Patient IDs are sorted by increasing sample ID.

Solutions file on test input data submitted by participants shall follow the same format as the train output file but with a Sample ID column containing the test sample IDs similarly sorted by increasing values using three digits and zero padding (e.g. ID_003.npy) and with float numbers (mutation probability) in the Target column.

      Sample ID    

0      ID_003.npy

1      ID_004.npy

2      ID_008.npy

... ...

147   ID_493.npy


The test data are split for evaluation purposes between two test sets of (almost) equal size: test set 1, used to establish the public ranking, and test 2, used to establish final and intermediary rankings.

Training set Public test set Private test set
Number of patients (samples) 305 (344) 76 (76) 73 (73)
Number of patients (samples) with PIK3CA mutations 112 (128) ? ?

Note on heterogeneity

We would like to highlight two things:

  • 73 patients is a really small test set and therefore variability between the two test sets can be expected, especially if you are overfitting the first test set by using a lot of submissions. However, the small size of our datasets is a real life problem that we have to deal with and being able to overcome this barrier is a crucial point.

  • A second very important source of poor transfer is the disparity between centers. Indeed, data originates from 5 different centers. The training set comprises data from 3 centers, while the test set contains data from the 2 remaining ones. It would be wise to take this source of heterogeneity into account.


The metric for the challenge is the Area Under the ROC Curve (AUC) and is computed as follows:

An example of whole slide image

where I₁ is the set of indices of the N₁ patients with label 1 (presence of mutation), I₀ the set of indices of the N₀ patients with label 0 and ỹᵢ the predicted mutation probability for sample i. A score of 0.5 represents a random prediction and a score of 1.0 represents perfect predictions.

Benchmark description

Our baseline model is a logistic regression whose input is the 0-axis mean of the tile feature vectors obtained from one slide (all the feature vectors are averaged into one feature vector, for each slide). The feature extractor used to obtain these vectors is a Wide ResNet-50-2 trained on TCGA-COAD with the constrative self-supervised model MoCo v2 (see paper here for more information on the benefits of self-supervision for feature extraction on histology images). Our baseline model reaches an AUC of 60.2% on the public test set. The implementation of the baseline is provided in supplementary files (baseline.ipynb).


Files are accessible when logged in and registered to the challenge

The challenge provider