Recent advances in visual object tracking mainly concentrate on RGB data. With the availability of depth sensors that capture spatial information, trackers that consider RGBD data might leverage tracking performance. This work investigates how state-of-the-art RGB trackers can be adapted to profit from the additional information. Specifically, this work proposes a framework consisting of two components: a self-supervised pretraining part and a supervised part. In the self-supervised pretraining component, domain transfer learning serves as pretext task. The general concept is to learn one modality from the other, predicting depth from RGB and foreground-background segmentation from depth. We analyze different preprocessing techniques, RGB pretrained networks for the depth domain, fusion layers, and backbone architectures in the supervised training component. Empirical results show that the proposed framework's supervised component outperforms its RGB counterpart when training on the same images. However, the sparsity of RGBD datasets poses data distribution problems between training and validation data. The results with self-supervised pretraining show a significant improvement compared to training from scratch by exploiting unlabeled RGBD data, effectively tackling dataset sparsity.
«
Recent advances in visual object tracking mainly concentrate on RGB data. With the availability of depth sensors that capture spatial information, trackers that consider RGBD data might leverage tracking performance. This work investigates how state-of-the-art RGB trackers can be adapted to profit from the additional information. Specifically, this work proposes a framework consisting of two components: a self-supervised pretraining part and a supervised part. In the self-supervised pretraining...
»