The pillar of computer vision was always considered by many to be the problem of alignment. The past decades we made great progress with systems that can find a set of sparse correspondences between an image and a template model. The pipeline of using such systems comes with a sequence of steps, first being the object detection, then classification, sparse landmark registration, and finally the 3d model registration which finally gives us the per pixel correspondences between the image and the object of interest. This not only results in a complex system, but also makes it difficult to distil information that are encapsulated in the different stages.
In this presentation I will talk about an end-to-end trainable method which not only can run in real-time, but moreover gives us directly a dense registration of the object-at-hand. Due to its generic nature it can be easily employed for an array of objects like human faces, ears, and even be used for deformable objects with huge pose variability such as human bodies. Finally, I will talk about future possible continuations of this work to build a more robust system using ideas from domain adaptation.