Challenge Data

Learning biological properties of molecules from their structure
by Simulations Plus


Login to your account to try this challenge!


Description


NO LOGO FOR THIS CHALLENGE
Dates

Started on Jan. 5, 2022



Challenge context

Simulations Plus conducts scientific research in the areas of property prediction and simulation relevant to pharmaceutical R&D. We specialize in what is called predictive ADMET (Absorption, Distribution, Metabolism, Elimination and Toxicity of chemical compounds in biological organisms), PBPK (Physiologically-Based Pharmacokinetics), pharmacometrics, and quantitative systems pharmacology/toxicology. The results of our research are turned into useful software tools used by pharmaceutical scientists in the industry, academia, and government agencies as well as consulting services. More details can be found at our website.


Challenge goals

The goal is to discover the biological properties of new chemical compounds using already existing experimental data.

The current costs of bringing a new medical drug to the market are huge, reaching 2.0 billion US dollars and 10-15 years of continuous research. A desire to eliminate many of these unnecessary costs has accelerated the emergence and acceptance of the science of Cheminformatics. Based on the concept that "similar chemicals have similar properties", one would take existing experimental data Y and build statistical correlative models to create a map between structures of chemical compounds and the observed Y values. Thus, the property Y of novel chemical compounds would not have to be measured. Instead, one would simply draw a structure of a completely new molecule on the computer screen and submit it to the correlative model to predict it.

Computers cannot perceive chemical structures (atoms plus interatomic connectivity) the way human chemists do. A translation of chemical structures into terms understandable by computers is thus necessary. Sophisticated algorithms exist that take molecular connectivity tables and, sometimes, 3D atomic coordinates to generate molecular descriptors – numeric variables that describe molecular structures. Our software is able to calculate the same set of N molecular descriptors per compound, where N is on the order of several hundred. Collecting the properly aligned vectors of descriptors for M chemical compounds, each with known observed Y value, forms an MxN training matrix X. Since raw values of different molecular descriptors are calculated on different scales, normalization to a common scale is required prior to modeling (e.g., the -1 to 1 scale). Not all the descriptors provide meaningful input into a successful Y = f(X) model. Therefore, choosing the “right” descriptors for modeling (a.k.a. feature selection) is the first critical step in model building. Once the suitable subset of N columns is chosen, the corresponding reduced training matrix along with M-dimensional vector Y is submitted to a model training algorithm (e.g., machine learning). Model performance is evaluated on an independent, external test set of T compounds encoded with the same set of N descriptors.


Data description

For this challenge we have prepared an input training matrix of 1087 chemical compounds, each with 295 molecular descriptors in CSV format (input_training.csv). While 1087 data points seems very low in comparison to, e.g., weather data, keep in mind that each and every one of these 1087 values was a result of an expensive and laborious experiment. Thus, it is fair to say the data set was obtained with an investment of many millions of euro.The compounds are distinguished with a simple consecutive _ID numbers in the first column (from 1 to 1087). Columns 2-296 contain non-consecutively labeled molecular descriptors X1 – X339. Values in all columns have been normalized to the [-1, 1] interval. Another file (output_training.csv) contains a column of 1087 observed values of a biological property Y.

The property Y in this challenge is related to an interaction of chemical compounds with living organisms and as such is of great importance to the pharmaceutical industry. To prevent some participants from gaining an unfair advantage, the exact nature of property Y is concealed and so are the names of compounds and molecular descriptors. Participants with access to special resources, for example, could measure additional Y in their lab, or obtain additional data from scientific literature. All the information about Y and molecular descriptors will be revealed after the challenge is concluded.

The training set will be used for model building, and participants should take care of overtraining; note that the “leave-one-out” cross-validation is not recommended for this purpose, see A. Golbraikh, A. Tropsha, J. Mol. Graphics Modell. 2002, 20, 269-276 for reasons why this method fails here.

We have prepared another set of 272 compounds intended as the external test set for model validation (input_testing.csv). It carries a matrix with 272 rows and 295 columns – the latter are the same molecular descriptors labeled X1 – X339 as in the training set. The compounds are distinguished by the continued _ID numbering (1088 - 1359). Observed values for the test compounds are not provided to participants. These will be used by challenge organizers for model evaluation: Participants will run their best models on this set and submit resulting 272 predictions, with the same _ID labels and in the same order - an example of random submission is provided to participants, so that they can adapt to the submission format.

Remark. The officially recommended “ID” title for the first column actually triggers a well-known bug in Microsoft Excel upon CSV import (see here for documentation). The underscore _ID has been added to prevent it.


Benchmark description

During the challenge participants will submit test set predictions and the organizers will calculate the corresponding test MSE for progress tracking purposes. A properly trained model’s training and test statistics should be close. Significantly worse performance on the test set, for example, may indicate model overtraining.

We propose a simple Multiple Linear Regression (MLR) method, trained with feature selection, as a benchmark. In our tests, MLR models with 50 - 150 descriptors selected by sensitivity resulted in MSE approximately equal to 0.04 on both training and test sets. It is expected that machine learning methods should perform better than MLR.


Files


Files are accessible when logged in and registered to the challenge


The challenge provider