Challenge Data

Data Centric Movie Reviews Sentiment Classification
by Kili Technology

Warning: in this challenge, your submission will be sent to Kili's external API and a whole deep learning model will be trained on the submitted data. The total processing time is around 60s, but might be higher depending on your internet speed. Be patient!

Login to your account to try this challenge!



Started on Jan. 5, 2022

Challenge context

As machine learning models go into production, practioners often realize that the data matters more than the model. Until now, AI research focus was more on models than data. However, rather than spending time on building bigger architectures, testing or building new fancy models, it is often a better use of time to iterate on the datasets. Andrew Ng sparked this model-centric to data-centric AI paradigm shift a few months ago (see this youtube video). Kili embodies this shift, by providing engineers the tools to work on the data. Iterating on a dataset can mean : cleaning label errors, cleaning domain errors, pre processing the data, identifying hidden sub-classification, carefully augmenting the data on which the model performs worse, generating new data, sourcing new data, etc... To our knowledge, this challenge is the first NLP data centric competition, after the first computer vision challenge this summer. Organizations are starting to pick up this movement, by using tools like Kili Technology to iterate and understand their dataset better, which help achieve the expected business performance.

Challenge goals

The goal of the challenge is to predict the sentiment (positive or negative) of movie reviews. The interest of the challenge lies in the training pipeline being kept fixed. You won't be able to choose the model to use, or have to create complex ensembles of models that add no real value. Instead, you will have to select, augment, clean, generate, source, etc… a training dataset starting from a given dataset. Actually you will be allowed to give anything to the model. To allow you to iterate fast on your experiments, we provide you with the training script, which uses a rather simple model, fastText. Your submissions (ie, the training dataset) will be used to train the model on our servers, which will then be tested on a hidden test set of movie reviews. We reveal a small fraction of this test set (10 texts) to give a sense of the test data distribution.

Data description

Input data

The input file is a dataset of 10k movie reviews texts. Noise has been added to this dataset : label errors, texts taken from other datasets, augmentations, etc… The dataset contains two columns : - review : a text (str) which can contain a few sentences - category : (int) (positive : 1, negative : 0)

Output data

The output file y has the same format as the input file. It should contain a curated, augmented, improved version of the data of the input file. The output file should contain at most 20k samples. If you submit a dataset with more than 20k points, you will get an accuracy of 0.

Test data

The test data contains 10k movie reviews. You can find here a sample of 10 examples of the test set.

Evaluation metric

We use a custom metric file to compute an accuracy from your training dataset. We provide the training script in the supplementary material

Benchmark description

Since the training of the model is dedicated to the evaluation phase, a simple benchmark solution is to submit the given dataset !


Files are accessible when logged in and registered to the challenge

The challenge provider


Kili Technology is the leading Data-Centric AI platform to iterate and ship your AI projects to production faster. Kili is an industrial-grade, collaborative, simple application that enables you to create and manage training data.