Learning 3D Human Geometry and Appearance via Sparse Multiview Images

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Learning 3D Human Geometry and Appearance via Sparse Multiview Images

Alternative title

Published Date

2024-06

Publisher

Type

Thesis or Dissertation

Abstract

Humans are arguably the most interesting subjects in computer vision. Modeling human 3D geometry from images captured by highly sophisticated production level cameras (10-100 cameras with precise calibration) enable a number of applications, e.g. telepresence, virtual try-on, motion analysis, etc. Despite its production level quality, the difficulty in system deployment and the extremely high cost prevent accessibility to the majorities. On the other hand, as smartphones equipped with cameras become an integral part of our everyday lives that capture priceless moments, social videos voluntarily captured by multiple viewers watching the same scene, e.g., friends recording a street busker simultaneously, provide a new form of visual input source accessible by everyone. My research question is whether it is possible to model humans from social videos at high quality as if they are taken by the production level setup. Enabling this will open a new opportunity to model 3D human geometry from in-the-wild data. The main characteristics of such videos are that they are sparse multiview by nature, and not spatially calibrated in general. These characteristics pose an unprecedented challenge because existing multiview approaches of 3D reconstruction do not apply: due to the sparse multiview camera setting, the overlap between social cameras are very limited where 3D photometric matching is difficult, and due to lack of calibration, existing geometric triangulation does not apply. To date, there is no principle way to integrate multiview social images. In order to reconstruct 3D humans from social videos, I leverage the complimentary relationship between 3D geometry and learning, which can help each other. (1) Multiview geometry → learning (Part 1): I design a framework that can learn dense keypoint mappings (i.e. correspondences between human pixels and a canonical 3D body surface agnostic to identifies, views and poses) from unlabeled sparse multiview images with minimal overlap. The key insight is to leverage multiview geometric consistency as a selfsupervisionary signal by enforcing epipolar constraint for corresponding pixels (mapped to the same location on the 3D body surface) from different views. I demonstrate that the method shows superior performance compared to existing methods, including non-differentiable bootstrapping in terms of keypoint accuracy, multiview consistency, and 3D reconstruction accuracy. (2) Learning → 3D geometry (Part 2): I develop a learning-based 3D reconstruction method that can integrate visual cues from multiview images without a spatial calibration and estimate a unified human 3D geometry. The key idea is to treat the commonly observed human body as a semantic calibration target and utilize pre-learned dense keypoint mappings to semantically align visual features from multiview images on a canonical 3D body surface, where features are fused for predicting 3D human body shape and pose. I demonstrate that this calibration-free multiview fusion method reliably reconstructs 3D body pose and shape, outperforming state-of-the-art single view methods with post-hoc multiview fusion, particularly in the presence of non-trivial occlusion, and showing comparable accuracy to multiview methods that require calibration. Given reconstructed 3D human geometry, I further establish an approach to create geometry-anchored animatable 3D head avatars with photo-realistic appearance from sparse inputs per user, e.g. just a few selfies from different views taken by a smartphone casually (Part 3). The core of this approach is to learn a universal model from various identities of a range of expressions that encodes generic characteristics of animatable head avatars, which can serve as a prior model capable of being adapted to a new subject given only a few images. I demonstrate that this approach produces compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation. To facilitate modeling 3D human geometry and appearance, I create a large multiview dataset for human body expressions (Part 4). 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and physical condition performing predefined actions. With the dense multiview image streams, I reconstruct high fidelity body expressions using 3D mesh models, which allows representing view-specific appearance using their canonical atlas. I demonstrate that the dataset is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. In summary, this thesis presents three closely-related methods for learning human geometry and appearance via sparse multiview images. Their inputs and outputs are linked together in a chain: 2D dense keypoints → 3D geometry → appearance, output of one being input for the next. Besides, it introduces a large multiview dataset for human body expressions to facilitate this goal.

Keywords

Description

University of Minnesota Ph.D. dissertation. June 2024. Major: Computer Science. Advisor: Hyun Soo Park. 1 computer file (PDF); xix, 132 pages.

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

Yu, Zhixuan. (2024). Learning 3D Human Geometry and Appearance via Sparse Multiview Images. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/269607.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.