Building Facades to Normal Maps: Adversarial Learning from Single View Images

Project page for our Building Facades to Normal Maps – Adversarial Learning from Single View Images work accepted at CRV 2021.


Authors: Mukul Khanna*, Tanu Sharma*, Ayyappa Swamy Thatavarthy, K. Madhava Krishna


Abstract

Surface normal estimation is an essential component of several computer and robot vision pipelines. While this problem has been extensively studied, most approaches are geared towards indoor scenes and often rely on multiple modalities (depth, multiple views) for accurate estimation of normal maps. Outdoor scenes pose a greater challenge as they exhibit significant lighting variation, often contain occluders, and structures like building facades are often ridden with numerous windows and protrusions. Conventional supervised learning schemes excel in indoor scenes, but do not exhibit competitive performance when trained and deployed in outdoor environments. Furthermore, they are not geared towards real-time inference. To tackle these challenges, we present an adversarial learning scheme that regularizes the output normal maps from a neural network to appear more realistic, by using a small number of precisely annotated examples. Our method presents a lightweight and simpler architecture, while improving performance by at least 2x to 4x across most metrics.


Dataset

Custom Synthia dataset with plane instance annotations and normal maps.
Overview

The Synthia dataset includes RGB images, depth maps, camera poses, and semantic segmentation maps of synthetic city scene image sequences. However, due to the unavailability of plane instance segmentation and normal maps in Synthia, we have manually annotated the plane instances for a subset of the dataset (Synthia-Summer-Seq-04). Using these plane instance annotations, for each plane, we randomly choose three pixels on the plane and convert them to 3D camera points using the corresponding depth map and camera matrix. Subsequently, we use the three 3D points to obtain normal and depth values for each plane.

This dataset comprises 2020 city scene RGB images, normal maps, and segmentation masks of size 1280 x 760 with 1620 training and 400 testing samples.

Resources

Acknowledgement

We thank Shivaan Sehgal and Sidhant Subramanian for preparing the manual annotations.