We introduce the first approach to solve the challeng-ing problem of unsupervised 4D visual scene understand-ing for complex dynamic scenes with multiple interactingpeople from multi-view video. Our approach simultane-ously estimates a detailed model that includes a per-pixelsemantically and temporally coherent reconstruction, to-gether with instance-level segmentation exploiting photo-consistency, semantic and motion information. We furtherleverage recent advances in 3D pose estimation to constrainthe joint semantic instance segmentation and 4D temporallycoherent reconstruction. This enables per person seman-tic instance segmentation of multiple interacting people incomplex dynamic scenes. Extensive evaluation of the jointvisual scene understanding framework against state-of-the-art methods on challenging indoor and outdoor sequencesdemonstrates a significant (≈40%) improvement in seman-tic segmentation, reconstruction and scene flow accuracy.


U4D: Unsupervised 4D Dynamic Scene Understanding
Armin Mustafa, Chris Russell and Adrian Hilton
ICCV 2019


Data used in this work can be found in the CVSSP 3D Data Repository.


This research was supported by the Royal Academy of Engineering Research Fellowship RF-201718-17177, and the European Commission and EPSRC Platform Grant on Audio-Visual Media Research EP/P022529.