Supervised and Unsupervised Methods for Vision-Based Object Detection, Counting and 3D Reconstruction

Haeni, Nicolai2023-11-282023-11-282023-06https://hdl.handle.net/11299/258747University of Minnesota Ph.D. dissertation. June 2023. Major: Computer Science. Advisor: Volkan Isler. 1 computer file (PDF); xii, 176 pages.The introduction of AlexNet in 2012 marked a significant turning point in computer vision and artificial intelligence research. Before AlexNet, neural networks were deemed impractical due to the computational and memory requirements needed to train them. However, AlexNet showed that neural networks are a practical solution for image classification tasks by outperforming the prior state-of-the-art (SOTA) on the ImageNet challenge by 10.8\%. This breakthrough paved the way for developing more complex neural networks, leading to remarkable advances in computer vision, natural language processing, and speech recognition. A decade later, foundational models, like Meta's Segment Anything Model (SAM) and DINO v2, show remarkable generalization capabilities in feature extraction and instance segmentation. As these models are trained on billions of labeled images, they can significantly improve other applications beyond their original purpose through the power of transfer learning. However, to achieve top-of-the-line performance in specialized domains, such as precision agriculture or robotics, \textit{task-specific} datasets are still necessary for fine-tuning these models. Unfortunately, acquiring sufficient amounts of data for fine-tuning is often prohibitively expensive, limiting the application of state-of-the-art models and impeding scientific research. For example, images for fruit detection often contain hundreds of small fruit per image, requiring roughly 30 minutes of labeling effort per image. Without established benchmark datasets, researchers often rely on small sets of highly correlated samples for training and testing. In this thesis, we approach the problem of algorithm design in limited data settings from two angles:1. For well-established problems such as object detection, counting, and segmentation, we analyze the performance of deep learning-based methods in the context of fruit detection and counting. We first compare them to traditional methods and quantify performance gains. We also release a new large benchmark dataset to the community. 2. In the second part, we propose new algorithms for two 3D computer vision problems: novel view synthesis (NVS) and single view 3D reconstruction. We propose a new algorithm for novel view synthesis that uses only two images per object for training but generalizes to arbitrary views. For single-view 3D reconstruction, we propose a method that leverages equivariant feature extraction for joint object pose estimation and 3D shape reconstruction. Our solutions require less labeled data by leveraging techniques such as self-supervised learning, cyclic consistency, and equivariant neural networks, making them more practical for applications in which data collection is difficult or expensive. In the first part of this thesis, we focus on the challenge of close-up fruit inspection and automated data collection in outdoor fruit orchard environments. More specifically, we investigate the problem of visual servoing, where the objective is to accurately position a sensor mounted on a robotic manipulator with respect to a target fruit. We propose a learning-free approach that employs image-based visual servoing techniques and traditional feature descriptors. Our method leverages computationally inexpensive feature tracking and demonstrates that the resulting system can converge effectively even under significant environmental influences, such as strong wind. Our work in this area showcases the effectiveness of learning-free approaches in overcoming data scarcity in real-world environments. The second part of this thesis is dedicated to developing and analyzing fruit detection and counting algorithms for fruit yield mapping. In this part, we propose learning-based fruit detection and counting methods and analyze their performance improvements compared to learning-free baselines. We individually test the detection and counting modules and perform an extensive evaluation as part of the overall yield mapping pipeline. Our analysis reveals that learning-based methods increase performance on counting clustered fruits but struggle on the fruit detection task compared to a model created through human feedback. The proposed yield estimation method, which combines the best-performing fruit detection and counting methods into a yield estimation pipeline, achieves 98% accuracy compared to ground truth yield, outperforming all existing state-of-the-art baselines in the literature. Our work in this part of the thesis highlights the benefits and challenges of supervised learning-based methods in limited data contexts such as fruit yield mapping. To further aid research, we release a new benchmark dataset to the community to facilitate further research in this field. It is gratifying to note that the dataset has been downloaded over 40,000 times, and our algorithms have already been incorporated into a commercialization effort. In the third part of this dissertation, we investigate two 3D computer vision problems: novel view synthesis and single view 3D reconstruction. For novel view synthesis, we present a new category-specific model with $50\times$ better data efficiency for training without compromising performance. We introduce Continuous Object Representation Networks (CORN), a conditional architecture that captures the geometry and appearance of an input image, mapping it to a consistent 3D scene representation. CORN can be trained with only two source images per object, leveraging a neural renderer. CORN does not require ground truth 3D models or target view supervision and instead uses cyclic consistency between the two input views for supervision. Nevertheless, it performs remarkably well on complex tasks such as novel view synthesis and single-view 3D reconstruction, matching the state-of-the-art approaches that rely on direct supervision.In the last problem, we tackle the challenge of 3D reconstruction with limited real-world data. Existing methods rely on pre-canonicalized datasets, which hinders training on diverse, unaligned shapes. In contrast, we propose a technique that simultaneously learns pose estimation and 3D reconstruction from unaligned shapes using shape canonicalization. Our approach outperforms existing methods on synthetic and real-world data without requiring test time optimization or ground truth camera poses. Our approach naturally generalizes to real-world data by training solely on synthetic shapes, reducing the need for large-scale datasets with 3D ground truth. In summary, this thesis proposes novel approaches to address the challenges of limited \textit{task-specific} data availability in various application domains, including precision agriculture, novel view synthesis, and 3D reconstruction. The proposed algorithms demonstrate how data collection and algorithm design can be leveraged to overcome data scarcity in real-world environments. Moreover, the thesis contributes a new benchmark dataset to the community, facilitating further research in the field. The findings of this work showcase the potential of AI technology to solve real-world problems even with limited data availability while maintaining or surpassing the current state-of-the-art.enSupervised and Unsupervised Methods for Vision-Based Object Detection, Counting and 3D ReconstructionThesis or Dissertation