Making Sense of Sounds Data Challenge

Can machine systems categorise sound events like a human?

Humans (with no hearing impairment) use sound in everyday life constantly to interpret their surrounding environment, refocus their attention, detect anomalies and communicate through language and vocal emotional expressions. They are able to identify a large number of sounds, e.g., the call of a bird, the noise of an engine, the cry of a baby, the sound of a string instrument. They are also capable of generalising from past experience to new sounds, e.g. recognising a dulcimer or a kora as a musical instrument despite having never heard this instrument before in their life. The Making Sense of Sounds data challenge calls for machine systems to attempt to replicate this human ability.

Task description

The task is to classify audio data as belonging to one of five broad categories, which were derived from human classification. In a psychological experiment at the University of Salford, participants were asked to categorise 60 sound types, chosen so as to represent the most commonly used search terms on Freesound.org. Five principal categories were identified by correspondence analysis and hierarchical cluster analysis of the human data:


Within each class the data for the task consists of varying sound types, e.g., different animals in the ‘Nature’ category or different instruments in the ‘Music’ category such as ‘guitar’ and ‘mandolin’. Most of the sound types are represented by several instances themselves, coming from different recordings, e.g. different guitars. The machine classifier is therefore forced to reproduce a human capability to be successful: Humans are able to identify a hitherto unheard animal sound as belonging to an animal based upon previously established schemas, and a hitherto unheard musical instrument as a musical instrument, etc.

The data set was randomly split per category in a development set, for which the class labels are provided and a held-out evaluation set, for which the audio data will published later and the labels only after the challenge is completed. Since the allocation of specific sound types to the development and evaluation sets was (pseudo-) random, the resulting sets are (intentionally) not balanced in this respect, e.g., it is not guaranteed that the number of samples for each sound types is proportionally equal in the development set and the evaluation set or even that the sound type is represented at all in both data sets.

We recommend, of course, cross-validation on the development set, but do not suggest a split of the data set into folds. Given the nature of the data, this might be a task not to be underestimated on its own. The use of external data and data augmentation is permitted (see Rules section below).

The challenge results will be announced at the 2018 DCASE (Detection and Classification of Acoustic Scenes and Events) workshop.

Important dates:

Challenge announcement and development data set release: 8. Aug. 2018
Evaluation data set release: 1. Oct. 2018
Submission open: 1. Oct. 2018
Submission closing: 30. Oct. 2018
Challenge results announcement: 19/20. Nov. 2018


Audio data

The audio files were taken from Freesound data base, the ESC-50 dataset and the Cambridge-MT Multitrack Download Library.

The development dataset consists of 1500 audio files divided into the five categories, each containing 300 files. The number of different sound types within each category is not balanced. The evaluation dataset consists of 500 audio files, 100 files per category. All files have an identical format: single-channel 44.1 kHz, 16-bit .wav files. All files are exactly 5 seconds long, but may feature periods of silence, i.e. a given extract may feature a sound that is shorter than the duration of the audio file.

It should be assumed that all files in this challenge are provided under the licence
CC-BY-NC 4.0 (Creative Commons, Attribution Noncommercial). This is the most restrictive licence of any file in the dataset, though some were also provided under CC0 and CC-BY. A complete listing of the exact licences and author attributions will be released at the close of the challenge. Attribution information is being withheld until then to prevent the gathering of any classification data based on the original file names.


The class labels are provided with the Logsheet_Development.csv text file. Each entry contains the broad class, the sound type and the file name separated by commas e.g., a line might read like this:



Development data set: Here on figshare.

Evaluation data set: Will be published later.


The system output file to be submitted should be a CSV text file. For every file in the evaluation set, there should be an entry (line) consisting of the file name and the estimated class label separated by a comma. The output file should not contain any sound type labels. A line might read like this:


Details about the submission system will be published later.


The challenge will use average accuracy as its performance measure, that is, the number of correctly classified items (files) per class divided by the total number of items (files) per class and then averaged over all classes to arrive at a single number.


Development Data Set:

External Data:

Pre-trained Networks/Classifiers:

Evaluation Data: