Challenge Data

ADMET property prediction for drug discovery
by ENS - PSL


Login to your account


Description


NO LOGO FOR THIS CHALLENGE
Teaching challenge
Chemistry
Biology
Classification
Texts
Less than 10MB
Basic level

Dates

Started on Oct. 24, 2024


Challenge context

💊 A drug is a chemical substance designed to interact with specific biological targets to treat, cure, or prevent diseases, with its main objective being efficacy on the intended target (usually a protein). However, for a drug to be successful, it must also meet crucial ADMET conditions: Absorption, Distribution, Metabolism, Excretion, and Toxicity. These properties determine how the drug is absorbed and distributed in the body, how it's metabolized and excreted, and whether it exhibits any toxic effects. A drug may have strong efficacy, but without favorable ADMET properties, it may fail to be safe or effective in patients.

🤖 Machine learning is increasingly valuable in optimizing ADMET properties during drug development. Traditional experimental approaches to assess ADMET are time-consuming, costly, and require significant resources. Machine learning models can analyze vast datasets of chemical compounds and their associated ADMET profiles, identifying patterns and predicting how new drug candidates will behave in terms of absorption, distribution, metabolism, excretion, and toxicity. By leveraging these predictive models, researchers can rapidly screen large libraries of compounds, prioritize those with the most favorable ADMET properties, and optimize lead compounds more efficiently. This reduces the likelihood of late-stage failures, accelerates the drug discovery process, and lowers development costs.


Challenge goals

The challenge proposed here is a supervised multi-label classification problem, the objective being to predict the ADMET profile on three properties of various molecules. For each property, a value of 1 indicates that the property falls within an optimal range, while a value of 0 signifies that it does not. These binary labels are derived based on specific thresholds applied to the experimental data used to generate the dataset. Participants are tasked with developing predictive models to classify compounds accordingly, aiding in the identification of candidates with favorable ADMET profiles and enhancing the efficiency of the drug discovery process.

In the attached X_train.csv and X_test.csv files, molecules are represented using the SMILES (Simplified Molecular Input Line Entry System) format, which encodes molecular structures as strings. While it's possible to work directly with SMILES, they can also be converted into molecular graphs or molecular fingerprints. Fingerprints are binary vectors that capture the presence or absence of specific chemical features, such as functional groups, atom types, bonds, or other molecular properties. This is a common method for encoding the chemical information of a molecule, and in the introductory notebook, fingerprints were used to generate the baseline score.

metric metric


Data description

You have at your disposal 5 files :

  • X_train.csv : Contains the 1940 unique IDs of the training molecules along with their corresponding SMILES representations.
  • y_train.csv : Contains the 1940 unique IDs of the training molecules and their associated labels for the three properties (Y1, Y2, and Y3).
  • X_test.csv : Contains the 486 unique IDs of the test molecules along with their SMILES representations.
  • random_submission_example.csv : Provides an example of a submission in the correct format.
  • supplementary_files : Includes a notebook introducing the challenge and replicating the baseline score.

You will find different columns in the tabular files : - ID : Unique identification numbers for the molecules. - SMILES : SMILES representation of the molecules. - Y1 : Labels for the first property. - Y2 : Labels for the second property. - Y3 : Labels for the third property.


Benchmark description

Metric

The metric used is the micro-average F1 score, a performance metric used in multi-class or multi-label classification tasks. It combines precision and recall into a single metric that takes into account the class imbalance. Here's how it works:

  • Precision: The ratio of true positive predictions to the total number of positive predictions made (true positives + false positives).
  • Recall: The ratio of true positive predictions to the total number of actual positive instances (true positives + false negatives).
  • F1 Score: The harmonic mean of precision and recall, which balances these two metrics.

Micro-Averaging:
When dealing with multiple classes or labels, micro-averaging calculates the metric by summing up the individual true positives, false positives, and false negatives across all classes, and then calculating precision and recall from these aggregated counts.

metric

Benchmark

A baseline score was obtained by converting the SMILES to Morgan fingerprints and then training a Decision Tree algorithm with the default parameters from scikit-learn. You can reproduce this score by using the introduction notebook associated with the challenge.


Files


Files are accessible when logged in and registered to the challenge


The challenge provider


PROVIDER LOGO

ENS - PSL |Laboratoire PASTEUR