DeepLoc

How does it work?

Vlocnet Architecture

Network architecture

VLocNet is a new convolutional neural network architecture for 6-DoF global pose regression and odometry estimation from consecutive monocular images. Our multitask model incorporates hard parameter sharing, thus being compact and enabling real-time inference on a consumer grade GPU. We propose a novel Geometric Consistency Loss function that utilizes auxiliary learning to leverage relative pose information during training, thereby constraining the search space to obtain consistent pose estimates.

We evaluate our proposed VLocNet on the challenging Microsoft 7-Scenes benchmark and the Cambridge Landmarks dataset, and show that even our single task model exceeds the performance of state-of-the-art deep architectures for global localization, while achieving competitive performance for visual odometry estimation. Furthermore, we present extensive experimental evaluations utilizing our proposed Geometric Consistency Loss that show the effectiveness of auxiliary learning and demonstrate that our model is the first deep learning technique to be on par with, and in some cases outperform state-of-the-art SIFT-based approaches.

Vlocnet++ Architecture

Network architecture

The VlocNet++ model jointly estimates the global pose, odometry and semantic segmentation from consecutive monocular images. We build upon the recently introduced VLocNet architecture and propose key improvements to encode geometric and structural constraints into the pose regression network, namely by incorporating information from the previous timesteps to accumulate motion specific information and by adaptively fusing semantic features based on the activations in the region using our proposed adaptive fusion scheme. We also propose a novel self-supervised aggregation technique based on differential warping that improves the segmentation accuracy and reduces the training time by half. Our architecture consists of four CNN streams; a global pose regression stream, a semantic segmentation stream and a Siamese-type double stream for visual odometry estimation.

Given a pair of consecutive monocular images \(I_{t-1}, I_{t} \in \mathbb{R}^{\rho}\), the pose regression stream in our network predicts the global pose \(\mathbf{p}_{t}=[\mathbf{x}_{t}, \mathbf{q}_{t}]\) for image \(I_{t}\), where \(\mathbf{x} \in \mathbb{R}^3\) denotes the translation and \(\mathbf{q} \in \mathbb{R}^4\) denotes the rotation in quaternion representation, while the semantic stream predicts a pixel-wise segmentation mask \(M_t\) mapping each pixel \(u\) to one of the \(\mathit{C}\) semantic classes, and the odometry stream predicts the relative motion \(\mathbf{p}_{t, t-1}=[\mathbf{x}_{t, t-1}, \mathbf{q}_{t, t-1}]\) between consecutive input frames. \(z_{t-1}\) denotes the feature maps from the previous timestep. Extensive evaluations on the Microsoft 7-Scenes dataset and our newly introduced DeepLoc dataset demonstrate that our approach sets the state-of-the-art, while being more than 60.5% faster and simultaneously performing multiple tasks. For more information on the approach, please refer to the arXiv submission.

University of Freiburg

DeepLoc

VLocNet & VLocNet++ Demo

Multi-Task Neural Networks for Visual Localization, Odometry and Semantic Segmentation

Please Select a Model:

Selected Model

Vlocnet++

Input Image

Segmented Image

How does it work?

Vlocnet Architecture

Vlocnet++ Architecture

DeepLoc Dataset

License Agreement

DeepLocCross Dataset

License Agreement

Qualitative Results

Microsoft 7-Scenes Dataset

Chess

Fire

Heads

Office

Pumpkin

Red-kitchen

Stairs

DeepLoc Dataset

DeepLoc

Videos

Publications

People

Abhinav Valada

Noha Radwan

Wolfram Burgard