Started on Jan. 5, 2022
As machine learning models go into production, practioners often realize that the data matters more than the model. Until now, AI research focus was more on models than data. However, rather than spending time on building bigger architectures, testing or building new fancy models, it is often a better use of time to iterate on the datasets.
Andrew Ng sparked this model-centric to data-centric AI paradigm shift a few months ago (see this youtube video). Kili embodies this shift, by providing engineers the tools to work on the data.
Iterating on a dataset can mean : cleaning label errors, cleaning domain errors, pre processing the data, identifying hidden sub-classification, carefully augmenting the data on which the model performs worse, generating new data, sourcing new data, etc...
To our knowledge, this challenge is the first NLP data centric competition, after the first computer vision challenge this summer.
Organizations are starting to pick up this movement, by using tools like Kili Technology to iterate and understand their dataset better, which help achieve the expected business performance.
This is a data-centric challenge. You will have to submit the best data, not the best model.
Data centric challenge
The interest of the challenge lies in the training pipeline being kept fixed. Instead of improving a machine learning model / training pipeline, you will have to select, augment, clean, generate, source, etc… a training dataset starting from a given dataset. Actually you will be allowed to give anything to the model.
The underlying machine learning task is to predict the sentiment (positive or negative) of movie reviews.
You won't be able to choose the model to use, or have to create complex ensembles of models that add no real value. To allow you to iterate fast on your experiments, we provide you with the training script, which uses a rather simple model, fastText.
What happens when you make a submission ?
Submissions are datasets of up to 20k movie reviews. When you submit a dataset, it is sent to our servers. This dataset is used to train a model using the same pipeline as provided. Then, the model trained with your dataset is tested on a different test set of movie reviews. The performance is measured by the accuracy on this test set.
The test set is kept hidden, because else you could just provide the test set as a submission, and let the model overfit on it.
We will reveal a small fraction of this test set (a few dozen texts) to give a sense of the test data distribution.
The input file is a dataset of 10k movie reviews texts.
Noise has been added to this dataset : label errors, texts taken from other datasets, augmentations, etc…
The dataset contains two columns :
- review : a text (str) which can contain a few sentences
- category : (int) (positive : 1, negative : 0)
The output file y has the same format as the input file. It should contain a curated, augmented, improved version of the data of the input file. The output file should contain at most 20k samples. If you submit a dataset with more than 20k points, you will get an accuracy of 0.
The test data contains 10k movie reviews.
We use a custom metric file to compute an accuracy from your training dataset. We provide the training script in the supplementary material
Since the training of the model is dedicated to the evaluation phase, a simple benchmark solution is to submit the given dataset !