Scene understanding and communication are two fundamental goals for intelligent agents. In this dissertation, we aim to understand the scene by estimating geometry, semantics and points of interest from single images using deep learning models. We also demonstrate the potential of learning in a hybrid SLAM system. We then discuss problems at the intersection of vision and language. We generate scene descriptions without training pairs of images and captions and enable user-agent interaction in natural language.
«
Scene understanding and communication are two fundamental goals for intelligent agents. In this dissertation, we aim to understand the scene by estimating geometry, semantics and points of interest from single images using deep learning models. We also demonstrate the potential of learning in a hybrid SLAM system. We then discuss problems at the intersection of vision and language. We generate scene descriptions without training pairs of images and captions and enable user-agent interaction in n...
»