This paper addresses the problems of object detection and 6DoF pose estimation from a sequence of RGB images. Our deep learning-based approach uses only synthetic non-textured 3D CAD models for training and has no access to the images from the target domain. The image sequence is used to obtain a sparse 3D reconstruction of the scene via Structure from Motion. The domain gap is closed by relying on the intuition that geometric edges are the only prominent features that can be extracted from both the 3D models and the sparse reconstructions. Based on this assumption, we have developed a domain-invariant data preparation scheme and 3DKeypointNet, which is a neural network for detecting of the 3D keypoints in sparse and noisy point clouds. The final pose is estimated with RANSAC and a scale-aware point cloud alignment method. The proposed method has been tested on the T-LESS dataset and compared to methods also trained on synthetic data. The results indicate the potential of our method despite the fact that the entire pipeline is solely trained on synthetic data.
«