ENS - PSL |Laboratoire PASTEUR
Started on Oct. 24, 2024
💊 A drug is a chemical substance designed to interact with specific biological targets to treat, cure, or prevent diseases, with its main objective being efficacy on the intended target (usually a protein). However, for a drug to be successful, it must also meet crucial ADMET conditions: Absorption, Distribution, Metabolism, Excretion, and Toxicity. These properties determine how the drug is absorbed and distributed in the body, how it's metabolized and excreted, and whether it exhibits any toxic effects. A drug may have strong efficacy, but without favorable ADMET properties, it may fail to be safe or effective in patients.
🤖 Machine learning is increasingly valuable in optimizing ADMET properties during drug development. Traditional experimental approaches to assess ADMET are time-consuming, costly, and require significant resources. Machine learning models can analyze vast datasets of chemical compounds and their associated ADMET profiles, identifying patterns and predicting how new drug candidates will behave in terms of absorption, distribution, metabolism, excretion, and toxicity. By leveraging these predictive models, researchers can rapidly screen large libraries of compounds, prioritize those with the most favorable ADMET properties, and optimize lead compounds more efficiently. This reduces the likelihood of late-stage failures, accelerates the drug discovery process, and lowers development costs.
The challenge proposed here is a supervised multi-label classification problem, the objective being to predict the ADMET profile on three properties of various molecules. For each property, a value of 1 indicates that the property falls within an optimal range, while a value of 0 signifies that it does not. These binary labels are derived based on specific thresholds applied to the experimental data used to generate the dataset. Participants are tasked with developing predictive models to classify compounds accordingly, aiding in the identification of candidates with favorable ADMET profiles and enhancing the efficiency of the drug discovery process.
In the attached X_train.csv and X_test.csv files, molecules are represented using the SMILES (Simplified Molecular Input Line Entry System) format, which encodes molecular structures as strings. While it's possible to work directly with SMILES, they can also be converted into molecular graphs or molecular fingerprints. Fingerprints are binary vectors that capture the presence or absence of specific chemical features, such as functional groups, atom types, bonds, or other molecular properties. This is a common method for encoding the chemical information of a molecule, and in the introductory notebook, fingerprints were used to generate the baseline score.
You have at your disposal 5 files :
You will find different columns in the tabular files : - ID : Unique identification numbers for the molecules. - SMILES : SMILES representation of the molecules. - Y1 : Labels for the first property. - Y2 : Labels for the second property. - Y3 : Labels for the third property.
The metric used is the micro-average F1 score, a performance metric used in multi-class or multi-label classification tasks. It combines precision and recall into a single metric that takes into account the class imbalance. Here's how it works:
Micro-Averaging:
When dealing with multiple classes or labels, micro-averaging calculates the metric by summing up the individual true positives, false positives, and false negatives across all classes, and then calculating precision and recall from these aggregated counts.
A baseline score was obtained by converting the SMILES to Morgan fingerprints and then training a Decision Tree algorithm with the default parameters from scikit-learn. You can reproduce this score by using the introduction notebook associated with the challenge.
Files are accessible when logged in and registered to the challenge