Making Sense of Sounds Data Challenge

Can machine systems categorise sound events like a human?

Update II: Labels of the evaluation set are now available!

Humans (with no hearing impairment) use sound in everyday life constantly to interpret their surrounding environment, refocus their attention, detect anomalies and communicate through language and vocal emotional expressions. They are able to identify a large number of sounds, e.g., the call of a bird, the noise of an engine, the cry of a baby, the sound of a string instrument. They are also capable of generalising from past experience to new sounds, e.g. recognising a dulcimer or a kora as a musical instrument despite having never heard this instrument before in their life. The Making Sense of Sounds data challenge calls for machine systems to attempt to replicate this human ability.

Task description

The task is to classify audio data as belonging to one of five broad categories, which were derived from human classification. In a psychological experiment at the University of Salford, participants were asked to categorise 60 sound types, chosen so as to represent the most commonly used search terms on Freesound.org. Five principal categories were identified by correspondence analysis and hierarchical cluster analysis of the human data:


Within each class the data for the task consists of varying sound types, e.g., different animals in the ‘Nature’ category or different instruments in the ‘Music’ category such as ‘guitar’ and ‘mandolin’. Most of the sound types are represented by several instances themselves, coming from different recordings, e.g. different guitars. The machine classifier is therefore forced to reproduce a human capability to be successful: Humans are able to identify a hitherto unheard animal sound as belonging to an animal based upon previously established schemas, and a hitherto unheard musical instrument as a musical instrument, etc.

The data set was randomly split per category in a development set, for which the class labels are provided and a held-out evaluation set, for which the audio data will published later and the labels only after the challenge is completed. Since the allocation of specific sound types to the development and evaluation sets was (pseudo-) random, the resulting sets are (intentionally) not balanced in this respect, e.g., it is not guaranteed that the number of samples for each sound type is proportionally equal in the development set and the evaluation set or even that the sound type is represented at all in both data sets.

We recommend, of course, cross-validation on the development set, but do not suggest a split of the data set into folds. Given the nature of the data, this might be a task not to be underestimated on its own. The use of external data and data augmentation is permitted (see Rules section below).

The challenge results will be announced at the 2018 DCASE (Detection and Classification of Acoustic Scenes and Events) workshop.

Important dates:

Challenge announcement and development data set release: 8. Aug. 2018
Evaluation data set release: 1. Oct. 2018
Submission open: 1. Oct. 2018
Submission closing: 30. Oct. 2018 5. Nov. 2018 (anywhere on earth)
Challenge results announcement: 19/20. Nov. 2018


Audio data

The audio files were taken from Freesound data base, the ESC-50 dataset and the Cambridge-MT Multitrack Download Library.

The development dataset consists of 1500 audio files divided into the five categories, each containing 300 files. The number of different sound types within each category is not balanced. The evaluation dataset consists of 500 audio files, 100 files per category. All files have an identical format: single-channel 44.1 kHz, 16-bit .wav files. All files are exactly 5 seconds long, but may feature periods of silence, i.e. a given extract may feature a sound that is shorter than the duration of the audio file.

It should be assumed that all files in this challenge are provided under the licence
CC-BY-NC 4.0 (Creative Commons, Attribution Noncommercial). This is the most restrictive licence of any file in the dataset, though some were also provided under CC0 and CC-BY. A complete listing of the exact licences and author attributions will be released at the close of the challenge. Attribution information is being withheld until then to prevent the gathering of any classification data based on the original file names.


The class labels are provided with the Logsheet_Development.csv text file. Each entry contains the broad class, the sound type and the file name separated by commas e.g., a line might read like this:



8/8/2018: Development data set released.

1/10/2018: Evaluation data set released.

27/11/2018: Copyright attribution for all files released.

29/11/2018: Evaluation data labels released.

Development & Evaluation data sets: Here on figshare.

Submission (closed)

The system output file to be submitted should be a CSV text file. For every file in the evaluation set, there should be an entry (line) consisting of the file name and the estimated class label separated by a comma. The output file should not contain any sound type labels. A line might read like this:


Submission will be via email. The address and other details will be announced with the opening of the submission system on 1. Oct. 2018. An extended abstract of up to two pages will be required with a short description of the submitted system(s). There will be no formatting requirements.

Up to four systems can be submitted per team.

There will be a ranking according to the error metric described below. It will be made public in a talk or poster at the DCASE 2018 workshop and published together with the extended abstracts here on the Making Sense of Sounds website.

Submission system (closed)

Please send the output file as attachment via email to: msos.submissions2018@gmail.com

(Do no use this submission email address for questions but the general mail address displayed at the bottom of the page)

Per team up to four systems can be submitted. Please submit each system with an email on its own.

The body of the email should contain the following information:


System id: UUMagicRec_1
Author names: Jane Engineer [1,2], John Scientist [2]
Authors’ affiliations:
[1] University of Nonexistence, Nowhere
[2] The Unseen University, Somewhere

As a second attachment add the required extended abstract outlining your method. Please use only PDF format! If you submit several systems, a single abstract is sufficient if it covers all systems. There are no requirements concerning the form, but of course you should give your abstract a meaningful title, mention all authors and their affiliations and provide at least one contact email address.

The reception of the submission will be acknowledged via automated return mail. If we encounter problems with your submitted file, we will contact you at the address from which the submission was sent.


The challenge will use average accuracy as its performance measure, that is, the number of correctly classified items (files) per class divided by the total number of items (files) per class and then averaged over all classes to arrive at a single number.


08/10/2018: We added a strong state-of-the-art deep learning baseline for comparison. The scores on the evaluation set are:

Average: 0.81

The baseline system is based on a VGG model with 8 convolutional layers, the filter size of each convolutional layer is 3x3. Batch normalisation is applied after each convolutional layer followed by a rectifier. Then a global max pooling operation on the feature maps of the last convolutional layer is utilised to summarise the feature maps to a vector. Finally, a fully connected layer is followed by a sigmoid or softmax nonlinearity to generate the probabilities of the audio classes.

Details will be published in an upcoming paper submitted to ICASSP 2019.



Development Data Set:

External Data:

Pre-trained Networks/Classifiers:

Evaluation Data:



21/11/2018: Twenty-three systems (including the baseline) from 12 teams were submitted, originating both from academia and industry.


Note that in the bar graph above the results are not rounded, but for the official ranking in the table at the end of the page they are. Systems with differences in the percentage decimal place can be safely assumed as being not significantly different in their performance.

Mean confusion matrix for all systems (n = 23)
Standard deviations of the confusion matrix for all systems (n = 23)

The standard deviation above is likely to be dependent on the magnitude of the mean. Accordingly the coefficient of variation (standard deviation divided by the mean) will give a more accurate picture of the variability across all systems:

Coefficients of variation of the confusion matrix for all systems (n = 23)


Click on the links in the system names in the table below for the confusion matrices of the individual systems and the abstracts describing the methods!


1Tianxiang Chen, Udit GuptaPindrop Security Inc.CBIR_30.930.911.000.920.950.89
1Tianxiang Chen, Udit GuptaPindrop Security Inc.CBIR_20.930.960.970.910.900.89
3Bongjun KimNorthwestern University, USANU_BK_10.920.891.000.920.890.90
3Bongjun KimNorthwestern University, USANU_BK_20.920.891.000.900.910.90
5Tianxiang Chen, Udit GuptaPindrop Security Inc.CBIR_10.910.920.970.920.870.90
6Bongjun KimNorthwestern University, USANU_BK_30.900.880.990.890.850.90
7Benjamin Elizalde, Abelino Jimenez, Bhiksha RajCarnegie Mellon University, USAMLSP_ONTEMB_10.890.860.990.860.890.84
8Tianxiang Chen, Udit GuptaPindrop Security Inc.CBIR_40.880.851.000.890.810.85
8Abelino Jimenez, Benjamin Elizalde, Bhiksha RajCarnegie Mellon University, USAMLSP_ONTLAYER_10.880.830.990.820.890.86
10Aggelina Chatziagapi [1], Theodoros Giannakopoulos [1]Behavioral Signals Technologies, INCHCDA130.870.850.980.920.790.80
11Mansoor Rahimat Khan, Alexander Lerch, Hongzhao Guwalgiya, Siddharth Kumar Gururani, Ashis PatiGeorgia Tech Center for Music TechnologyGTCMT_MAHSA0.840.801.000.770.860.76
12Patrice GuyotIRIT, Université de Toulouse, CNRS, Toulouse, Francenevermind0.820.750.970.760.910.71
13Yin CaoCVSSP, University of Surrey, UKMSoS_baseline_10.810.700.950.810.880.70
13Souvic ChakrabortyIIT KharagpurChakra_Souvic0.810.810.990.750.770.71
15Souvic ChakrabortyIIT KharagpurChakra_Souvic ensemble0.800.801.000.770.750.70
15Hadrien JeanEcole Normale Supérieure, ParisH_NET0010.800.980.720.800.580.94
17Rajdeep Mukherjee [1], Pradhumn Goyal [1], Dipyaman Banerjee [2], Kuntal Dey [2], Pawan Goyal [1][1] Indian Institute of Technology, Kharagpur, India [2] IBM Research - India, New Delhi, IndiaMukherjee_Goyal_10.770.690.930.760.780.69
18Rajdeep Mukherjee [1], Pradhumn Goyal [1], Dipyaman Banerjee [2], Kuntal Dey [2], Pawan Goyal [1][1] Indian Institute of Technology, Kharagpur, India [2] IBM Research - India, New Delhi, IndiaMukherjee_Goyal_20.740.660.900.770.790.62
19Rajdeep Mukherjee [1], Pradhumn Goyal [1], Dipyaman Banerjee [2], Kuntal Dey [2], Pawan Goyal [1][1] Indian Institute of Technology, Kharagpur, India [2] IBM Research - India, New Delhi, IndiaMukherjee_Goyal_30.660.700.880.530.770.41
20Md Sultan Mahmud [1], Mohammed Yeasin [1], Faruk Ahmed [1], Rakib Al-Fahad [1], and Gavin M. Bidelman [2, 3, 4][1] Department of Electrical & Computer Engineering, University of Memphis, Memphis, TN, USA [2] School of Communication Sciences & Disorders, University of Memphis, Memphis, TN, USA [3] Institute for Intelligent Systems, University of Memphis, Memphis, TN, USA [4] University of Tennessee Health Sciences Center, Department of Anatomy and Neurobiology, Memphis, TN, USACVPIALAB_ACNL0.580.520.840.520.650.39
21Rajdeep Mukherjee [1], Pradhumn Goyal [1], Dipyaman Banerjee [2], Kuntal Dey [2], Pawan Goyal [1][1] Indian Institute of Technology, Kharagpur, India [2] IBM Research - India, New Delhi, IndiaMukherjee_Goyal_40.560.590.640.470.670.44
22Patrice GuyotIRIT, Université de Toulouse, CNRS, Toulouse, Francesimplemind0.500.460.770.430.630.23
23Stavros NtalampirasUniversity of Milan, ItalyDAG_HMM0.450.460.310.660.250.59