--

 

Making Sense of Sounds Data Challenge

Can machine systems categorise sound events like a human?

Update: Extension of submission deadline to 5. November 2018.

Humans (with no hearing impairment) use sound in everyday life constantly to interpret their surrounding environment, refocus their attention, detect anomalies and communicate through language and vocal emotional expressions. They are able to identify a large number of sounds, e.g., the call of a bird, the noise of an engine, the cry of a baby, the sound of a string instrument. They are also capable of generalising from past experience to new sounds, e.g. recognising a dulcimer or a kora as a musical instrument despite having never heard this instrument before in their life. The Making Sense of Sounds data challenge calls for machine systems to attempt to replicate this human ability.

Task description

The task is to classify audio data as belonging to one of five broad categories, which were derived from human classification. In a psychological experiment at the University of Salford, participants were asked to categorise 60 sound types, chosen so as to represent the most commonly used search terms on Freesound.org. Five principal categories were identified by correspondence analysis and hierarchical cluster analysis of the human data:

Nature
Human
Music
Effects
Urban

Within each class the data for the task consists of varying sound types, e.g., different animals in the ‘Nature’ category or different instruments in the ‘Music’ category such as ‘guitar’ and ‘mandolin’. Most of the sound types are represented by several instances themselves, coming from different recordings, e.g. different guitars. The machine classifier is therefore forced to reproduce a human capability to be successful: Humans are able to identify a hitherto unheard animal sound as belonging to an animal based upon previously established schemas, and a hitherto unheard musical instrument as a musical instrument, etc.

The data set was randomly split per category in a development set, for which the class labels are provided and a held-out evaluation set, for which the audio data will published later and the labels only after the challenge is completed. Since the allocation of specific sound types to the development and evaluation sets was (pseudo-) random, the resulting sets are (intentionally) not balanced in this respect, e.g., it is not guaranteed that the number of samples for each sound type is proportionally equal in the development set and the evaluation set or even that the sound type is represented at all in both data sets.

We recommend, of course, cross-validation on the development set, but do not suggest a split of the data set into folds. Given the nature of the data, this might be a task not to be underestimated on its own. The use of external data and data augmentation is permitted (see Rules section below).

The challenge results will be announced at the 2018 DCASE (Detection and Classification of Acoustic Scenes and Events) workshop.

Important dates:

Challenge announcement and development data set release: 8. Aug. 2018
Evaluation data set release: 1. Oct. 2018
Submission open: 1. Oct. 2018
Submission closing: 30. Oct. 2018 5. Nov. 2018 (anywhere on earth)
Challenge results announcement: 19/20. Nov. 2018

Dataset

Audio data

The audio files were taken from Freesound data base, the ESC-50 dataset and the Cambridge-MT Multitrack Download Library.

The development dataset consists of 1500 audio files divided into the five categories, each containing 300 files. The number of different sound types within each category is not balanced. The evaluation dataset consists of 500 audio files, 100 files per category. All files have an identical format: single-channel 44.1 kHz, 16-bit .wav files. All files are exactly 5 seconds long, but may feature periods of silence, i.e. a given extract may feature a sound that is shorter than the duration of the audio file.

It should be assumed that all files in this challenge are provided under the licence
CC-BY-NC 4.0 (Creative Commons, Attribution Noncommercial). This is the most restrictive licence of any file in the dataset, though some were also provided under CC0 and CC-BY. A complete listing of the exact licences and author attributions will be released at the close of the challenge. Attribution information is being withheld until then to prevent the gathering of any classification data based on the original file names.

Labels

The class labels are provided with the Logsheet_Development.csv text file. Each entry contains the broad class, the sound type and the file name separated by commas e.g., a line might read like this:

Effects,Beep,TN7.wav

Download

1/10/2018: Evaluation data set released.

Development & Evaluation data sets: Here on figshare.

Submission

The system output file to be submitted should be a CSV text file. For every file in the evaluation set, there should be an entry (line) consisting of the file name and the estimated class label separated by a comma. The output file should not contain any sound type labels. A line might read like this:

94Q.wav,Human

Submission will be via email. The address and other details will be announced with the opening of the submission system on 1. Oct. 2018. An extended abstract of up to two pages will be required with a short description of the submitted system(s). There will be no formatting requirements.

Up to four systems can be submitted per team.

There will be a ranking according to the error metric described below. It will be made public in a talk or poster at the DCASE 2018 workshop and published together with the extended abstracts here on the Making Sense of Sounds website.

Submission system opened

Please send the output file as attachment via email to: msos.submissions2018@gmail.com

(Do no use this submission email address for questions but the general mail address displayed at the bottom of the page)

Per team up to four systems can be submitted. Please submit each system with an email on its own.

The body of the email should contain the following information:

Example:

System id: UUMagicRec_1
Author names: Jane Engineer [1,2], John Scientist [2]
Authors’ affiliations:
[1] University of Nonexistence, Nowhere
[2] The Unseen University, Somewhere

As a second attachment add the required extended abstract outlining your method. Please use only PDF format! If you submit several systems, a single abstract is sufficient if it covers all systems. There are no requirements concerning the form, but of course you should give your abstract a meaningful title, mention all authors and their affiliations and provide at least one contact email address.

The reception of the submission will be acknowledged via automated return mail. If we encounter problems with your submitted file, we will contact you at the address from which the submission was sent.

Evaluation

The challenge will use average accuracy as its performance measure, that is, the number of correctly classified items (files) per class divided by the total number of items (files) per class and then averaged over all classes to arrive at a single number.

Baseline

08/10/2018: We added a strong state-of-the-art deep learning baseline for comparison. The scores on the evaluation set are:

Average: 0.81

The baseline system is based on a VGG model with 8 convolutional layers, the filter size of each convolutional layer is 3x3. Batch normalisation is applied after each convolutional layer followed by a rectifier. Then a global max pooling operation on the feature maps of the last convolutional layer is utilised to summarise the feature maps to a vector. Finally, a fully connected layer is followed by a sigmoid or softmax nonlinearity to generate the probabilities of the audio classes.

Details will be published in an upcoming paper submitted to ICASSP 2019.

 

Rules

Development Data Set:

External Data:

Pre-trained Networks/Classifiers:

Evaluation Data:

Contact

MSoS.challenge@gmail.com