Nearly all real-world image understanding problems in computer vision are inherently ambiguous. Often, predictive systems do not model this ambiguity and do not consider the possibility that there can be more than just a single outcome for a given problem. This leads to sub-par performance on ambiguous tasks as the model has to account for all possibilities with one answer. We define three typical sources of confusion that render tasks not optimally solvable with a single unique prediction.
First, the fact that we are using 2D images means that the input to our models is a non-reversible projection of the 3D real world. Thus, information about sizes, distances and shapes is lost and has to be inferred by the algorithm from cues such as shading, texture or relative size. Sometimes the ambiguities cannot be resolved by the image context alone. For example, objects of interest can be occluded or too small and do not provide sufficient details to accurately solve a task. The second source of ambiguity are the annotations in our datasets. In supervised or semi-supervised learning scenarios, predictive systems are typically trained using labeled data. However, labeling large amounts of data is tedious and expensive. Thus, often not everything is fully annotated or the annotations can be wrong and inconsistent if they come from unreliable sources, such as non-experts, crowd-sourcing or automatically from the internet. Finally, and most importantly, the problem itself could be ambiguous. For example, predicting the future is inherently uncertain and, in most cases, it cannot be precisely foreseen.
In this dissertation we describe two principled and general approaches of dealing with ambiguity. First, we elaborate on a method that allows the algorithm to predict multiple instead of one single answer. This is a pragmatic way of dealing with ambiguity: instead of deciding for an exclusive outcome for a given problem, we enable the system to list several possibilities. This approach has already proven to be useful in several exciting downstream applications. We show that the general framework of multiple hypothesis prediction can be used for image classification, segmentation, future prediction, conditional image generation, reinforcement learning, object pose estimation and robotic grasp point estimation.
The second part describes an alternative way to deal with uncertain predictions. Often human perception can provide additional information about a task or application that an intelligent system might have not recognized. Building on the paradigm of human-machine interaction, we show how enabling interaction between the system and a user can improve predictions on the example of semantic segmentation. We describe a novel guiding mechanism that can be seamlessly integrated into the system and shows great potential beyond the demonstrated tasks for several further applications.
«
Nearly all real-world image understanding problems in computer vision are inherently ambiguous. Often, predictive systems do not model this ambiguity and do not consider the possibility that there can be more than just a single outcome for a given problem. This leads to sub-par performance on ambiguous tasks as the model has to account for all possibilities with one answer. We define three typical sources of confusion that render tasks not optimally solvable with a single unique prediction.
F...
»