Challenge data

Description

Teaching challenge

Chemistry

Biology

Classification

Texts

Less than 10MB

Basic level

Dates

Started on Oct. 24, 2024

Challenge context

💊 A drug is a chemical substance designed to interact with specific biological targets to treat, cure, or prevent diseases, with its main objective being efficacy on the intended target (usually a protein). However, for a drug to be successful, it must also meet crucial ADMET conditions: Absorption, Distribution, Metabolism, Excretion, and Toxicity. These properties determine how the drug is absorbed and distributed in the body, how it's metabolized and excreted, and whether it exhibits any toxic effects. A drug may have strong efficacy, but without favorable ADMET properties, it may fail to be safe or effective in patients.

🤖 Machine learning is increasingly valuable in optimizing ADMET properties during drug development. Traditional experimental approaches to assess ADMET are time-consuming, costly, and require significant resources. Machine learning models can analyze vast datasets of chemical compounds and their associated ADMET profiles, identifying patterns and predicting how new drug candidates will behave in terms of absorption, distribution, metabolism, excretion, and toxicity. By leveraging these predictive models, researchers can rapidly screen large libraries of compounds, prioritize those with the most favorable ADMET properties, and optimize lead compounds more efficiently. This reduces the likelihood of late-stage failures, accelerates the drug discovery process, and lowers development costs.

Challenge goals

The challenge proposed here is a supervised multi-label classification problem, the objective being to predict the ADMET profile on three properties of various molecules. For each property, a value of 1 indicates that the property falls within an optimal range, while a value of 0 signifies that it does not. These binary labels are derived based on specific thresholds applied to the experimental data used to generate the dataset. Participants are tasked with developing predictive models to classify compounds accordingly, aiding in the identification of candidates with favorable ADMET profiles and enhancing the efficiency of the drug discovery process.

In the attached X_train.csv and X_test.csv files, molecules are represented using the SMILES (Simplified Molecular Input Line Entry System) format, which encodes molecular structures as strings. While it's possible to work directly with SMILES, they can also be converted into molecular graphs or molecular fingerprints. Fingerprints are binary vectors that capture the presence or absence of specific chemical features, such as functional groups, atom types, bonds, or other molecular properties. This is a common method for encoding the chemical information of a molecule, and in the introductory notebook, fingerprints were used to generate the baseline score.

metric metric

Data description

You have at your disposal 5 files :

X_train.csv : Contains the 1940 unique IDs of the training molecules along with their corresponding SMILES representations.
y_train.csv : Contains the 1940 unique IDs of the training molecules and their associated labels for the three properties (Y1, Y2, and Y3).
X_test.csv : Contains the 486 unique IDs of the test molecules along with their SMILES representations.
random_submission_example.csv : Provides an example of a submission in the correct format.
supplementary_files : Includes a notebook introducing the challenge and replicating the baseline score.

You will find different columns in the tabular files : - ID : Unique identification numbers for the molecules. - SMILES : SMILES representation of the molecules. - Y1 : Labels for the first property. - Y2 : Labels for the second property. - Y3 : Labels for the third property.

Benchmark description

Metric

The metric used is the micro-average F1 score, a performance metric used in multi-class or multi-label classification tasks. It combines precision and recall into a single metric that takes into account the class imbalance. Here's how it works:

Precision: The ratio of true positive predictions to the total number of positive predictions made (true positives + false positives).
Recall: The ratio of true positive predictions to the total number of actual positive instances (true positives + false negatives).
F1 Score: The harmonic mean of precision and recall, which balances these two metrics.

Micro-Averaging:
When dealing with multiple classes or labels, micro-averaging calculates the metric by summing up the individual true positives, false positives, and false negatives across all classes, and then calculating precision and recall from these aggregated counts.

metric

Benchmark

A baseline score was obtained by converting the SMILES to Morgan fingerprints and then training a Decision Tree algorithm with the default parameters from scikit-learn. You can reproduce this score by using the introduction notebook associated with the challenge.

Files

Files are accessible when logged in and registered to the challenge

The challenge provider

ENS - PSL |Laboratoire PASTEUR

PROVIDER WEBSITE

Challenge Data

ADMET property prediction for drug discovery
by ENS - PSL

Description

Dates

Challenge context

Challenge goals

Data description

Benchmark description

Metric

Benchmark

Files

The challenge provider

Challenge Data

ADMET property prediction for drug discovery by ENS - PSL

Description

Dates

Challenge context

Challenge goals

Data description

Benchmark description

Metric

Benchmark

Files

The challenge provider

ADMET property prediction for drug discovery
by ENS - PSL