Challenge Data

Data Centric Movie Reviews Sentiment Classification
by Kili Technology

Warning: in this challenge, your submission will be sent to Kili's external API and a whole deep learning model will be trained on the submitted data. The total processing time is around 60s, but might be higher depending on your internet speed. Be patient!

Login to your account to try this challenge!



Started on Jan. 5, 2022

Challenge context

As machine learning models go into production, practioners often realize that the data matters more than the model. Until now, AI research focus was more on models than data. However, rather than spending time on building bigger architectures, testing or building new fancy models, it is often a better use of time to iterate on the datasets. Andrew Ng sparked this model-centric to data-centric AI paradigm shift a few months ago (see this youtube video). Kili embodies this shift, by providing engineers the tools to work on the data. Iterating on a dataset can mean : cleaning label errors, cleaning domain errors, pre processing the data, identifying hidden sub-classification, carefully augmenting the data on which the model performs worse, generating new data, sourcing new data, etc... To our knowledge, this challenge is the first NLP data centric competition, after the first computer vision challenge this summer. Organizations are starting to pick up this movement, by using tools like Kili Technology to iterate and understand their dataset better, which help achieve the expected business performance.

Challenge goals

This is a data-centric challenge. You will have to submit the best data, not the best model.

Data centric challenge

The interest of the challenge lies in the training pipeline being kept fixed. Instead of improving a machine learning model / training pipeline, you will have to select, augment, clean, generate, source, etc… a training dataset starting from a given dataset. Actually you will be allowed to give anything to the model.

Movie reviews

The underlying machine learning task is to predict the sentiment (positive or negative) of movie reviews. You won't be able to choose the model to use, or have to create complex ensembles of models that add no real value. To allow you to iterate fast on your experiments, we provide you with the training script, which uses a rather simple model, fastText.

What happens when you make a submission ?

Submissions are datasets of up to 20k movie reviews. When you submit a dataset, it is sent to our servers. This dataset is used to train a model using the same pipeline as provided. Then, the model trained with your dataset is tested on a different test set of movie reviews. The performance is measured by the accuracy on this test set. The test set is kept hidden, because else you could just provide the test set as a submission, and let the model overfit on it. We will reveal a small fraction of this test set (a few dozen texts) to give a sense of the test data distribution.

Data description

Input data

The input file is a dataset of 10k movie reviews texts. Noise has been added to this dataset : label errors, texts taken from other datasets, augmentations, etc… The dataset contains two columns : - review : a text (str) which can contain a few sentences - category : (int) (positive : 1, negative : 0)

Output data

The output file y has the same format as the input file. It should contain a curated, augmented, improved version of the data of the input file. The output file should contain at most 20k samples. If you submit a dataset with more than 20k points, you will get an accuracy of 0.

Test data

The test data contains 10k movie reviews.

Evaluation metric

We use a custom metric file to compute an accuracy from your training dataset. We provide the training script in the supplementary material

Benchmark description

Since the training of the model is dedicated to the evaluation phase, a simple benchmark solution is to submit the given dataset !


Files are accessible when logged in and registered to the challenge

The challenge provider


Kili Technology is the leading Data-Centric AI platform to iterate and ship your AI projects to production faster. Kili is an industrial-grade, collaborative, simple application that enables you to create and manage training data.