This study introduces an innovative approach that offers incremental and scalable solutions for constructing open set, instance-level 3D scene representations, leading to an open world understanding of 3D environment. Current methodologies in open vocabulary 3D scene understanding are predominantly non-incremental requiring pre-constructed 3D scenes, and they rely on learning per point feature vectors, creating scalability issues for many practical use cases. Moreover, their efficacy in contextualizing and responding to complex queries is considerably limited. The proposed method, addresses these limitations by leveraging 2D foundation models to incrementally construct instance-level 3D scene representations. It efficiently tracks and aggregates corresponding instance-level details (such as masks, feature vectors, names, captions etc.) from 2D foundation models to 3D space. Furthermore, our work introduces fusion schemes for feature vectors that effectively integrate contextual information, significantly enhancing performance on complex queries. Additionally, this study explores methods to effectively utilize large language models for, robust automatic annotation and complex spatial reasoning tasks over the constructed open set 3D scene. The proposed method is evaluated on ScanNet [4, 41] and Replica [44] datasets, both quantitative and qualitative results demonstrate its zero-shot generalization capabilities that exceeds current state-of-the-art methods in open world 3D scene understanding tasks.
«
This study introduces an innovative approach that offers incremental and scalable solutions for constructing open set, instance-level 3D scene representations, leading to an open world understanding of 3D environment. Current methodologies in open vocabulary 3D scene understanding are predominantly non-incremental requiring pre-constructed 3D scenes, and they rely on learning per point feature vectors, creating scalability issues for many practical use cases. Moreover, their efficacy in contextu...
»