Data Centric Movie Reviews Sentiment Classification
by Kili Technology
Warning: in this challenge, your submission will be sent to Kili's external API and a whole deep learning model will be trained on the submitted data. The total processing time is around 60s, but might be higher depending on your internet speed. Be patient!
As machine learning models go into production, practioners often realize that the data matters more than the model. Until now, AI research focus was more on models than data. However, rather than spending time on building bigger architectures, testing or building new fancy models, it is often a better use of time to iterate on the datasets.
Andrew Ng sparked this model-centric to data-centric AI paradigm shift a few months ago (see this youtube video). Kili embodies this shift, by providing engineers the tools to work on the data.
Iterating on a dataset can mean : cleaning label errors, cleaning domain errors, pre processing the data, identifying hidden sub-classification, carefully augmenting the data on which the model performs worse, generating new data, sourcing new data, etc...
To our knowledge, this challenge is the first NLP data centric competition, after the first computer vision challenge this summer.
Organizations are starting to pick up this movement, by using tools like Kili Technology to iterate and understand their dataset better, which help achieve the expected business performance.
The goal of the challenge is to predict the sentiment (positive or negative) of movie reviews. The interest of the challenge lies in the training pipeline being kept fixed.
You won't be able to choose the model to use, or have to create complex ensembles of models that add no real value.
Instead, you will have to select, augment, clean, generate, source, etc… a training dataset starting from a given dataset. Actually you will be allowed to give anything to the model.
To allow you to iterate fast on your experiments, we provide you with the training script, which uses a rather simple model, fastText.
Your submissions (ie, the training dataset) will be used to train the model on our servers, which will then be tested on a hidden test set of movie reviews.
We reveal a small fraction of this test set (10 texts) to give a sense of the test data distribution.
The input file is a dataset of 10k movie reviews texts.
Noise has been added to this dataset : label errors, texts taken from other datasets, augmentations, etc…
The dataset contains two columns :
- review : a text (str) which can contain a few sentences
- category : (int) (positive : 1, negative : 0)
The output file y has the same format as the input file. It should contain a curated, augmented, improved version of the data of the input file. The output file should contain at most 20k samples. If you submit a dataset with more than 20k points, you will get an accuracy of 0.
The test data contains 10k movie reviews. You can find here a sample of 10 examples of the test set.
We use a custom metric file to compute an accuracy from your training dataset. We provide the training script in the supplementary material
Since the training of the model is dedicated to the evaluation phase, a simple benchmark solution is to submit the given dataset !
Files are accessible when logged in and registered to the challenge
The challenge provider
Kili Technology is the leading Data-Centric AI platform to iterate and ship your AI projects to production faster. Kili is an industrial-grade, collaborative, simple application that enables you to create and manage training data.