Iterative deep neural networks for speaker-independent binaural blind speech separation
Q. Liu, Y. Xu, P. Coleman, P. Jackson W. Wang
We propose an iterative deep neural network(DNN)-based binaural source separation scheme, for recovering two concurrent speech signals in a room environment. Besides the commonly-used spectral features, the DNN also takes non-linearly wrapped binaural spatial features as input, which are refined iteratively using parameters estimated from the DNN output via a feedback loop. Different DNN structures have been tested, including a classic multilayer perception regression architecture as well as a new hybrid network with both convolutional and densely-connected layers. Objective evaluations in terms of PESQ and STOI showed consistent improvement over baseline methods using traditional binaural features, especially when the hybrid DNN architecture was employed. In addition, our proposed scheme is robust to mismatches between the training and testing data.
Audio Tagging with Connectionist Temporal Classification Model Using Sequential Labelled Data
Y. Hou, Q. Kong and S. Li
Labelling audio events is a common work in the establishment of audio dataset but is also known as very time consuming. Reducing the complexity of labelling audio events will not only saves the efforts of data annotation but also makes it easy to increase the audio dataset size. This poster presents a new type of acoustic event label, Sequential Labelled Data (SLD), which the presence of each audio event is labelled in a sorted order in time. To utilize SLD in audio tagging, a Convolutional Recurrent Neural Network followed by Connectionist Temporal Classification (CRNN-CTC) objective function was proposed. Experiments show that CRNN-CTC performs well in audio tagging.
Audio Set classification with attention model: A probabilistic perspective
Exploiting sparsity in array optimisation, source separation and tracking
M. Barnard and Wenwu Wang
Beamforming has been studied extensively especially in the area of smart antenna, microphone arrays and hydrophone arrays. Recent work in this area has focussed on the optimal design of arrays, including sparse compressive sensing methods that aim to reduce the number of sensor in the array whilst retaining the desired response. We take the opposite approach, given an array with arbitrarily missing sensors we use sparse optimisation methods to improve the response of this array. This problem particularly pertains to situations such as underwater hydrophone arrays where there is a high failure rate of individual sensors and the replacement of sensors is extremely difficult. We integrate the configuration of the array with failed sensors into the convex optimisation problem as an additional constraint. This produces a weighting on the remaining sensors that improves the response of the damaged array in particular reducing sidelobe levels.
Searching sound-effects using timbre
A. Pearce, T. Brookes, and R. Mason
Typically, much sound library content is annotated with tags according to the sound source/object (scream, car-crash, orchestra, etc.). The ability to search through these libraries could be greatly enhanced by allowing users to filter results using timbral descriptors (e.g. bright, soft, rough). As part of the AudioCommons project, the Institute of Sound Recording is developing automatic timbral annotation tools. In this demonstration, timbral metadata for hardness, depth, brightness, and roughness has been calculated using our perceptual modelling algorithms. Two of these attributes can then be selected to project the analysed sounds onto a two-dimensional space.
Listener Adaptive Transaural for Immersive 3D audio reproduction
Loudspeaker arrays allow for accurately controlling a soundfield. One of their uses is to reproduce binaural audio by controlling the radiated pressure at the input of the ears of one or various listeners, which is also known as Transaural audio. Transaural audio works by creating “virtual headphones” at the position of the listeners’ ears, which makes the technique quite dependent on the listener´s position. In order to avoid this, a computer vision system can be used to modify the output of the Transaural control algorithm so that the virtual headphones are always locked according to the listener´s position. This technology has been implemented in a series of listener-adaptive soundbars developed in the S3A project. These soundbars will be demonstrated as an alternative for 3D audio reproduction to headphones.
Object-based Production Tools
A. Franck and L. Remaggi
We present tools for object-based production implemented as digital audio workstation plugins. They enable the integration of most of the components researched and developed within the S3A project – from creation, production, object-based mastering, intelligent metering, and reproduction rendering, into one delivery chain. In this way we demonstrate how the different aspects of the S3A approach – for instance a rich set of object types, extensive use of semantic metadata, and multiple rendering methods – work together to enable novel object-based audio experiences.
Overview and comparison of musical source separation approaches and remixing the results
R. Kim and M. Helal
We used the SiSEC 2018 music database to benchmark a representative algorithm from the various source separation methods. The blind/unsupervised methods include tensor decomposition such as NonNegative Matrix Factorisation. We have a multivariate analysis method to decompose the latent variables from mixtures such as the Independent Component Analysis. We have a signal processing method such as the degernate unmixing extraction technique (Duet) using k-means and GMM clustering for 1d and 2d histograms. The supervised machine learning is Recurrent Neural Networks. We will demonstrate the original mixture and sources against the separated sources for the said algorithms along with Signal to Distortion Ratio (SDR) comparison. In addition, a re-mixing use-case will be demonstrated, where the users can create a binaural auditory scene using the separated sources as objects. It can be shown that with adequate re-mixing strategy in terms of level and location, the degradations from imperfect source separation algorithms can be perceptually acceptable.
End-to-end audio source separation using multi-resolution convolutional auto-encoders
E. M. Grais
Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multi-resolution convolutional neural network that works on the time-domain signals to determine appropriate multi-resolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing.
Sentimental Audio Memories demonstrator
T. Duel and D. Frohlich
SAM revolves around designing new means to support sentimental audio capture. We propose an audio recorder and a tangible database with a twist: incorporating GPS tracking, serendipity and machine learning into the practice of recording, archiving and listening back.
The Turning Forest: an Immersive VR Experience
The Turning Forest is a virtual reality fairytale written by Shelley Silas, directed by Oscar Raby, produced by VRTOV and 3D sound by BBC Research and Development in collaboration with the S3A project.
Virtually (Re)Presenting Place (vRSP) project
vRSP is developing outputs which are exemplars of a wholly new genre of art work, designed for virtual space but based on notable and rare historic places for which music will be written to integrate musical and social memory, and exploit unique architectural and acoustic environments. Live performances of newly composed works will be captured and disseminated to viewers and listeners as VR content through the application of a series of novel production techniques. These will enhance and address specific difficulties presented by current technologies, as described below, by applying interdisciplinary knowledge combining research from human perception, neuroscience, musical composition, animation, visual surface-mapping techniques and visual signal processing, stereoscopic 360 degree video and three- dimensional audio recording and reproduction methods. The network draws together experienced practitioners from all fields to inform the creation of a new work that will portray the intrinsic qualities of the historic performance space through site specific live performance and a combination of contemporary video- and animation-based techniques for VR generation.
Due to the size of rooms please see timings for your group
The Autumn Forest: 3D audio experience: Audio Visual Lab (09BB00)
With this demo, the object-based approach is exploited to reproduce the VR-based production into a 22.1 channel configuration, by employing the Surrey Sound Sphere.
Object-based Reverberation Editor GUI Audio Visual Lab (09BB00)
This demonstration is a version of the S3A Reverb Object for immersive object-based audio production: It showcases a library of different room data that is used to reconstruct an impulse response for reverberation in 3D. The rooms are loaded via a drop-down, and parameters in the UI will be updated to show the room’s perceptual quality. These parameters can be edited in real-time, to improve or change the quality of the loaded room. This is integrated into a digital-audio workstation (DAW), in which its state can be saved and used in specific DAW sessions. The parameters are low-level, and as well as panning, are intended to reshape segments of the impulse response: Early, Attack, Level and Decay.
Immersive audio using orchestrated personal devices: Living Lab (12BB01)
It’s possible to create great listening experiences with expensive spatial audio systems that use many high-quality loudspeakers in carefully specified positions, but it’s hard to produce the same level of immersion in a normal living room. However, it’s likely that there are a number of devices in the living room that are capable of producing sound: mobile phones, tablets, laptops, smart speakers, and so on. Media device orchestration is the concept of using an ad hoc set of connected devices to augment the reproduction of a media experience. We’re using orchestrated devices to produce immersive spatial audio at home.