Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

Dataset Overview

We present "Tragic Talkers", an audio-visual dataset consisting of excerpts from the "Romeo and Juliet" drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions.

The scenes were captured at the Centre for Vision, Speech & Signal Processing (CVSSP) of the University of Surrey (UK) with the aid of two twin Audio-Visual Array (AVA) Rigs. Each AVA Rig is a custom device consisting of a 16-element microphone array and 11 cameras fixed on a flat perspex sheet. For more information, please refer to the paper (see below) or contact the authors.

Dataset Links

The dataset requires registration, please check the license information and request access.

Short video previews of the 30 sequences of the Tragic Talkers dataset can be viewed from this link.

Useful python scripts to get started with the Tragic Talkers dataset (e.g. PyTorch dataloaders and audio-visual feature extractors) are available from our GitHub repository.

If you have received the username and password, you can download the TragicTalkers dataset.

Paper

This is the author’s version of the paper. It is posted here for your personal use. This paper is published under a Creative Commons Attribution (CC-BY) license. The definitive version was published in the ACM Digital Library, https://doi.org/10.1145/3565516.3565522

Tragic Talkers paper

License

The datasets are free for research use only.

This agreement must be confirmed by a senior representative of your organisation. To access and use this data you agree to the following conditions:

The copyright of the TragicTalkers dataset is owned by The Centre for Vision Speech and Signal Processing, University of Surrey, UK. The data should not be redistributed. Permission is hereby granted to use the TragicTalkers dataset for academic purposes only, provided that it is referenced in publications related to its use as follows:

D. Berghi, M. Volino and P. J. B. Jackson, "Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research," European Conference on Visual Media Production (CVMP), 2022, doi: 10.1145/3565516.3565522.

    @inproceedings{Berghi:2022:TragicTalkers,
	AUTHOR = "Berghi, Davide and Volino, Marco and Jackson, Philip J. B.",
	TITLE = "Tragic {T}alkers: A {S}hakespearean Sound- and Light-Field Dataset 
		for Audio-Visual Machine Learning Research",
	BOOKTITLE = "European Conference on Visual Media Production (CVMP)",
	PUBLISHER = "Association for Computing Machinery",
	YEAR = "2022",
	DOI  = "10.1145/3565516.3565522"
    }

To request access to the TragicTalkers Dataset, or for other queries please contact: davide.berghi@surrey.ac.uk (or at davide.berghi@gmail.com)

Acknowledgments

This work is supported by InnovateUK (105168) ‘Polymersive: Immersive video production tools for studio and live events’ and a PhD studentship from the Doctoral College of the University of Surrey. Thanks to actors Phoebe Salem and Mason Stickland for their time and availability; to Mohd Azri Mohd Izhar, Hansung Kim, and Charles Malleson for their contribution to the audio-visual recordings; to Umar Marikkar for developing utility scripts supporting easy access to the data, e.g., data loaders and audio-visual feature extractors; to Alexander Todd for helping during the laser cut of the AVA Rigs.