Grounding natural language to 3D scenes is an essential research topic for many upcoming interactive robotic agents or AR/VR applications. In recent years, there has been tremendous breakthroughs in segmenting objects in images from language. However, these methods and datasets are restricted to 2D views, where the 3D extent of an object and its surrounding environment are incompletely modelled. This limitation hinders applications where it is critical to understand the complete 3D context and the physical size, e.g. interacting with objects in the indoor scenes. In this dissertation, we explore the possible deep-learning-based methods for text-driven scene understanding on RGB-D data.
First, we introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51, 583 descriptions of 11, 046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.
Then, we introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin.
Recent work on dense captioning and visual grounding in 3D have achieved impressive results. Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. Also, how to discriminatively describe objects in complex 3D environments is not fully studied yet. To address these challenges, we present D3Net, an end-to-end neural speaker-listener architecture that can detect, describe and discriminate. Our method unifies dense captioning and visual grounding in 3D in a self-critical manner. Our method outperforms SOTA methods in both tasks on the ScanRefer dataset, surpassing the SOTA 3D dense captioning method by a significant margin.
Consequently, we discuss the limitations and potential future directions of our research.
«
Grounding natural language to 3D scenes is an essential research topic for many upcoming interactive robotic agents or AR/VR applications. In recent years, there has been tremendous breakthroughs in segmenting objects in images from language. However, these methods and datasets are restricted to 2D views, where the 3D extent of an object and its surrounding environment are incompletely modelled. This limitation hinders applications where it is critical to understand the complete 3D context and t...
»