HoloGAN: Unsupervised learning of 3D representations from natural images

Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, Yong-Liang Yang


We propose a novel generative adversarial network (GAN) for the task of unsupervised learning of 3D representations from natural images. Most generative models rely on 2D kernels to generate images and make few assumptions about the 3D world. These models therefore tend to create blurry images or artefacts in tasks that require a strong 3D understanding, such as novel-view synthesis. HoloGAN instead learns a 3Drepresentation of the world, and to render this representation in a realistic manner. Unlike other GANs, HoloGAN provides explicit control over the pose of generated objects through rigid-body transformations of the learnt 3D features. Our experiments show that using explicit 3D features enables HoloGAN to disentangle 3D pose and identity, which is further decomposed into shape and appearance, while still being able to generate images with similar or higher visual quality than other generative models. HoloGAN can be trained end-to-end from unlabelled 2D images only. Particularly, we do not require pose labels, 3D shapes, or multiple views of the same objects. This shows that HoloGAN is the first generative model that learns 3D representations from natural images in an entirely unsupervised manner.



To learn 3D representations from 2D images without labels, HoloGAN extends traditional unconditional GANs by introducing a strong inductive bias about the 3D world into the generator network. Specifically, HoloGAN generates images by learning a 3D representation of the world and to render it realistically such that it fools the discriminator. View manipulation therefore can be achieved by directly applying 3D rigid-body transformations to the learnt 3D features. In other words, the images created by the generator are a view-dependent mapping from a learnt 3D representation to the 2D image space. This is different from other GANs which learn to map a noise vector z directly to 2D features to generate images.


HoloGAN further decomposes identity into shapes (controlled by 3D features) and apperance (controlled by 2D features). We sample two latent codes, z1 and z2, and feed them through HoloGAN. This shows that by using 3D convolutions to learn 3D representations and 2D convolutions to learn shading, HoloGAN learns to separate shape from appearance directly from unlabelled images, allowing separate manipulation of these factors.



Separating pose and identity


Separating shapes and appearance