BlockGAN

Learning 3D Object-aware Scene Representations

from Unlabelled Images

Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, Niloy Mitra

Neural Information Processing Systems NeurIPS 2020

ABSTRACT

We present BlockGAN, an image generative model that learns object-aware 3D scene representations directly from unlabelled 2D images. Current work on scene representation learning either ignores scene background or treats the whole scene as one object. Meanwhile, work that considers scene compositionality treats scene objects only as image patches or 2D layers with alpha maps. Inspired by the computer graphics pipeline, we design BlockGAN to learn to first generate 3D features of background and foreground objects, then combine them into 3D features for the wholescene, and finally render them into realistic images. This allows BlockGAN to reason over occlusion and interaction between objects’ appearance, such as shadow and lighting, and provides control over each object’s 3D pose and identity, while maintaining image realism. BlockGAN is trained end-to-end, using only unlabelled single images, without the need for 3D geometry, pose labels, object masks, or multiple views of the same scene. Our experiments show that using explicit 3D features to represent objects allows BlockGAN to learn disentangled representations both in terms of objects (foreground and background) and their properties (pose and identity).

METHOD OVERVIEW

Instead of learning 2D layers of objects and combining them with alpha compositing, BlockGAN learns to generate 3D object features and to combine them into deep 3D scene features that are projected and rendered as 2D images. This process closely resembles the traditional computer graphics pipeline where scenes are modelled in 3D, enabling reasoning over occlusion and interaction between object appearance, such as shadows or highlights. We can also add new objects to the generated image by introducing more 3D object features to the 3D scene features, even when BlockGAN was trained with scenes containing fewer objects.

During training, we randomly sample both the noise vectors z and poses θ. During test time, each object’s pose can be manipulated using 3D transformations directly applied to the object’s deep 3D features. BlockGAN is trained end-to-end using only unlabelled 2D images using the GAN loss, without the need for any labels, such as poses, 3D shapes, multi-view inputs, masks, or geometry priors like shape templates, symmetry or smoothness terms.

RESULTS

BlockGAN learns to disentangle different objects within a scene: foreground from background, and between multiple foreground objects, despite only being trained with unlabelled images. This enables smooth manipulation of each object’s pose θ and identity z. More impor- tantly, since BlockGAN combines deep object features into scene features, changes in an object’s properties also influ- ence the background, e.g., an object’s shadows and highlights adapt to the object’s movement.

Synthetic Two Cars

Synthetic Chairs

Real Cars

Since objects are disentangled in BlockGAN’s scene representation, we can manipulate them separately. Here we apply spatial manipulations that were not part of the similarity transformation used during training, such as horizontal stretching, or slicing and combining different foreground objects.

The 3D object features learnt by BlockGAN can also be reused to add more objects to the scene at test time. Here we use BlockGAN trained on datasets with only one background and one fore- ground object, and show that more foreground objects of the same category can be added to the same scene to create novel scenes with multiple foreground objects. This shows that BlockGAN has learnt 3D object representations that can be reused and manipulated intuitively, instead of merely memorizing training images.