Jafarian, Yasamin2024-01-052024-01-052023-09https://hdl.handle.net/11299/259731University of Minnesota Ph.D. dissertation. September 2023. Major: Computer Science. Advisor: Hyun Soo Park. 1 computer file (PDF); xi,83 pages.One of the ongoing challenges in computer vision and graphics is to model realistic dressed humans in 3D. These models enable various applications in interactive entertainment, immersive technologies, and online shopping. To obtain such realistic dressed human models, most existing approaches rely heavily on sophisticated devices such as dense arrays of cameras (30 to 500 cameras) which enable measuring the appearance and 3D geometry of humans. These systems, however, are expensive and require heavy onsite instrumentation, which fundamentally limits their daily application. In my thesis, I argue that it is possible to obtain the 3D geometry of clothed humans by utilizing affordable and widely accessible devices such as a cell phone camera. However, due to the inherent nature of capturing 3D geometry, it necessitates the use of multi view images. Consequently, attempting to reconstruct 3D humans using only a single camera is fundamentally impossible. To overcome this limitation, recent work has explored machine learning approaches where a model is learned to infer human 3D geometry from a single image. To learn this model, ground truth data comprising pairs of 2D images and corresponding 3D human models (e.g., mesh) are needed to span diverse person identities, poses, appearances, and motion. Due to the hardware requirement (i.e., multicamera system), collecting human data from a large population of people at the scale of millions is challenging, resulting in performance degradation of 3D human reconstruction when applied to real world imagery. In this thesis, I develop geometry aware self-supervised approaches that enable learning of human 3D geometry without 3D ground truth data. Self-supervised depth via 3D geometric consistency:Surface normal and depth are strongly correlated, i.e., the surface normals are the first order spatial derivatives of the depths. We use this geometric relation as self-supervision on social media dance videos where millions of videos that span diverse identities, poses, appearance, and motion are readily available. Furthermore, we introduce a novel self-supervised learning approach that leverages local transformations. These transformations effectively warp the predicted local geometry of a person from one image to another image captured at a different time. By incorporating temporal consistency and surface normal cues, we achieve a high fidelity estimation of the 3D geometry of dressed humans from single view images and videos. Self-supervised dense correspondences via local isometry:To effectively re-texture a garment in an image, a dense correspondence map between the garment image and the texture space must consider not only the transformation of the garment caused by body movements and clothing fitting but also the intricate 3D surface geometry of the garment. To physically retexture a garment, we employ a geometry aware correspondence map estimation technique between the garment region in an image and its corresponding texture space. This dense correspondence map is specifically designed to maintain isometry with respect to the underlying 3D surface by utilizing the predicted 3D surface normals derived from the image. By adopting this approach, we are able to capture the intrinsic geometry of the garment in a self-supervised manner, eliminating the need for ground truth annotations of the correspondences. Moreover, our method can be readily extended to predict temporally coherent dense correspondence maps, ensuring consistency across different frames by correlating per-frame image features. Self-supervised garment warping via pose and local isometry:The way clothes drape on a body varies depending on different body shapes and poses. In the context of virtual try-on applications, where the objective is to transfer a specific garment from one person in an image to another individual with a different pose, it becomes crucial to learn a geometry aware deformation. We propose an end-to-end trainable, geometry-aware garment warping by estimating a dense correspondence map between the reference garment and target image using two self-supervision signals: human pose information and surface normals. We ensure that the 3D geometry of the garment is in correlation with the underlying human pose and shape, allowing for consistent correspondence across varying body shapes and postures. Furthermore, we preserve the garment's surface geometry in the estimated dense correspondences by incorporating the concept of isometry via surface normals. We introduce a novel dataset of approximately 50,000 images sourced from YouTube dance tutorials, thereby capturing a wide range of body shapes and poses. Our method synthesizes a high-fidelity image of the garment, complete with realistic wrinkles and folds, in the desired pose. Our approach offers the flexibility to adapt to different lighting conditions and garment textures.enLearning Geometric Representation of Dressed HumansThesis or Dissertation