ACM MULTIMEDIA 2018 Workshop

on Audio-Visual Scene Understanding for Immersive Multimedia

22-26 October 2018, Seoul, Korea

Workshop Proceeding

Proceeding on the ACM Digital Library (Free access)

Program (Fri. 26th Oct.)

09:00 - 09:10 Welcome and Opening Remarks

09:10 - 09:55 Invited keynote1 - Multimodal Fusion Strategies: Human vs. Machine

09:55 - 10:10 An Audio-Visual Method for Room Boundary Estimation and Material Recognition

10:10 - 10:25 A Deep Learning-based Stress Detection Algorithm with Speech Signal

10:25 - 10:30 (Demo) Listener-Adaptive Immersive Audio with Soundbars

10:30 - 11:00 Coffee Break with Demo

11:00 - 11:45 Invited keynote2 - Spatial Audio on the Web - Create, Compress, and Render

11:45 - 12:00 Generation Method for Immersive Bullet-Time Video Using an Omnidirectional Camera in VR Platform

12:00 - 12:15 Audio-Visual Attention Networks for Emotion Recognition

12:15 - 12:30 Towards Realistic Immersive Audiovisual Simulations for Hearing Research

12:30 - 12:35 Closing Remarks

Invited Speakers

Prof. Hanseok Ko (Korea University, co-Chair of ICASSP 2018, South Kroea)

Title: Multimodal Fusion Strategies Human vs. Machine

Abstract: Two-hour movie or a short movie clip as its subset is intended to capture and present a meaningful (or significant) story in video to be recognized and understood by human audience. What if we substitute the task of human audience with that of an intelligent machine or robot capable of capturing and processing the semantic information in terms of audio and video cues contained in the video? By using both auditory and visual means, human brain processes the audio (sound, speech) and video (background image scene, moving video objects, written characters) modalities to extract the spatial and temporal semantic information, that are contextually complementary and robust. Smart machines equipped with audiovisual multisensors (e.g. CCTV equipped with cameras and microphones) should be capable of achieving the same task. An appropriate fusion strategy combining the audio and visual information would be a key component in developing such artificial general intelligent (AGI) systems. This talk reviews the challenges of current video analytics schemes and explores various sensor fusion techniques to combine the audio-visual information cues for video content analytics task.

Bio: Hanseok Ko is Professor of Electrical and Computer Engineering and Director of Machine Learning Institute at Korea University. He received a B.S. degree from Carnegie Mellong University in 1982, MS degree from the Johns Hopkins University, and Ph.D in ECE from the Catholic University of America in 1986 and 1992 respectively. He joined the faculty of ECE, Korea University, in 1995. He was a visiting professor at the Center for Language and Speech Processing, JHU in 2001 and CS Dept, University of Maryland, in 2009. He has been credited as the main developer of core audiospeech interface for HyundaeKia Motors. He served as Director of STW-KU Intelligent Signal Processing Research Center, sponsored by Samsung in 2008~2013, to engage in research on the CCTV multimodal technologies addressing image and video analytics. He served as General Organizing Chair of IEEE AVSS 2014, Program Chair of IEEE Multisensor Fusion and Integration in 2008 and 2017, and co-General Organizing Chair of IEEE ICASSP 2018 Calgary. He was a founding member of the JCN journal and Editor for SJW and E-Bridge Journals. He is currently serving as Guest-Editor for Sensors Journal on the special issue addressing multisensor fusion strategies. He was awarded Research Excellence Award by Maeil Business in 2006. He is a Fellow of IET with his research interest in audiovideo signal processing and machine learning for video analytics and human-machine interface.

Dr. Jan Skoglund (Chrome Media Google, Google, USA)

Title: Spatial Audio on the Web - Create, Compress, and Render

Abstract: The recent surge of VR and AR has spawned an interest in spatial audio beyond its traditional delivery over loudspeakers in, e.g., home theater environments, to headphone delivery over, e.g., mobile devices. In this talk we'll discuss a web-based approach to spatial audio. It will cover creating real-time spatial audio directly in the browser, data compression of the audio for immersive media, and efficient binaural rendering.

Bio: Jan Skoglund received his Ph.D. degree from Chalmers University of Technology, Sweden. From 1999 to 2000, he worked on low bit rate speech coding at AT&T Labs-Research, Florham Park, NJ. He was with Global IP Solutions (GIPS), San Francisco, CA, from 2000 to 2011 working on speech and audio processing tailored for packet-switched networks. GIPS' audio and video technology was found in many deployments by, e.g., IBM, Google, Yahoo, WebEx, Skype, and Samsung. Since a 2011 acquisition of GIPS he has been a part of Chrome at Google, LLC. He leads a team in San Francisco, CA, developing speech and audio signal processing components for capture, real-time communication, storage, and rendering.