Challenge Data

Interpreting neural networks predictions for multi-label classification of music catalogs.
by Mewo

Register or login to participate !



Started on Jan. 4, 2021

Challenge context

Mewo is a catalog management service aimed at production music libraries. We offer a unified and consolidated platform to store, browse and distribute musical catalogs, from data importation to showcase deployment; our platform is widely used in France by leading broadcasters, advertising companies and record labels. Our solution features a Music Information Retrieval system for audio-based recommendation and automatic tagging, based on a pipeline of signal processing algorithms and deep neural networks.

Our auto-tagging models output numerical predictions (in the interval (0,1)) for each label they have been trained on; however, for our end-use, a decision must be made whether or not to associate a track to a tag. This can be done by a simple thresholding process (i.e. compare the numerical prediction to a fixed value, e.g. 0.5) but more elaborated algorithms coud be used to increase the quality of the labelling.

This is the object of this challenge: find the best discrimination method to convert numerical predictions into effective decisions, exploiting intrinsic information of our dataset.

Challenge goals

More precisely, our thesaurus comprises few hundreds tags (e.g. blues-rock, electric-guitar, happy), regrouped in classes (Genres, Instruments or Moods), partitioned into categories (genres-orchestral, instruments-brass, mood-dramatic, etc.). Each audio track of our database may be tagged with one or more labels of each class so the auto-tagging process is a multi-label classification problem; we can train neural networks to learn from audio features and generate numerical predictions to minimise the binary cross entropy with respect to the one-hot encoded labelling of the dataset.

On the other hand, to display the tagging on our front-end, we require a discrete, tag-wise, labelling, so a further interpretation is nedded, to convert the predictions into decisions, and we can use more suitable metrics to evaluate the quality of the tagging. We want the participants of the challenge to optimise this decision problem, leveraging all the possible information available from the groundtruth and the global predictions to design a selection algorithm producing the most consistent labelling. In other words, build a multi-label classifier, receiving, as input, the predictions generated by our neural networks for all tags and their categories.

Our suggested benchmark is a column-wise thresholding (see details below) so this strategy uses neither the categorical predictions, nor the possible correlations between tags. For example, a more row-oriented approach (for each track, select a tag for its prediction value with respect to the predictions for the other tags) or a hierarchical strategy (decide on categories first, then chose tags among the selected categories) may improve the final classifications.

Data description

For both the input and output variables, each row of data corresponds to a unique audio track in our database. The first column is ChallengeID and contains an integer indexing the track in the dataset.

Input variables

The input variables for the challenge are the outputs generated by our three neural networks, one for each class, producing parallel tag-wise and category-wise labelling. There is one column for each tag and each category; at each row, the value is the output of the corresponding neural network for the given label, in (0,1).

Because our labels are naturally partionned into classes and tag/category, the columns are not in full alphabetical order but rather regrouped by their types and organized as follows:

  • ChallengeID;
  • 90 tag-genres;
  • 112 tag-instruments;
  • 46 tag-moods;
  • 18 category-genres;
  • 15 category-instruments;
  • 8 category-moods.

Output variables

The output variables follow a similar structure with the tags columns only; the cell-values contain either a 0 or a 1 indicating the relation between a track and a label in the ground truth tagging. In other words, each line is the one-hot encoded tagging of the corresponding track, over the 248 tags with the following order:

  • ChallengeID;
  • 90 tag-genres;
  • 112 tag-instruments;
  • 46 tag-moods.

Annex files

  • labels.csv -- A three-lines-CSV containing the category/class relations of the labels. The header line is the same as for the output variables, with one column for each tag. The two other lines relate each tag to its category and class (Genres, Instruments or Moods).

  • train_y.category.csv -- A data CSV containing the category-wise one-hot encoding of the trainingset. We provide it for convenience, as it can easily be re-constructed pairing the train_y data with the relation deduced from the labels.csv file. The columns are organized as follows:

    • ChallengeID;
    • 18 category-genres;
    • 15 category-instruments;
    • 8 category-moods.

Benchmark description


The evaluation metric is the weighted F1-score over the tag-wise labelling.


The evaluation benchmark is obtained via a simple column-wise thresholding strategy, where each limit is set to maximise the tag-wise F1-score. We treat each column as a binary labelling problem and chose a threshold that maximises the corresponding F1-score.


This strategy yields a weighted F1-score of 54% on the train set and 46% on the test set. This is quite an improvement over a simple constant thresholding at 0.5, which evaluates to 41% and 33% on the train and test sets. However, this strategy remains simply column-wise so we believe there is still room for improvement.



Files are accessible when logged in and registered to the challenge

The challenge provider


Production music catalogue management.