Metaverse in the Wild: Modeling, Adapting, and Rendering of 3D Human Avatars from a Single Camera

Thumbnail Image

Persistent link to this item

View Statistics

Journal Title

Journal ISSN

Volume Title


Metaverse in the Wild: Modeling, Adapting, and Rendering of 3D Human Avatars from a Single Camera

Published Date




Thesis or Dissertation


Metaverse is poised to enter our daily lives as new social media. One positive application would be tele-presence that allows users to interact with others through the photorealistic 3D avatars using AR/VR headsets. Such tele-presence requires high fidelity 3D avatars, depicting fine-grained appearance, e.g., pore, hair, wrinkle on face, from any viewpoint. Previous works have utilized a system of multiview cameras to generate the 3D avatars, which enables measuring appearance and 3D geometry of a subject. Deploying such large camera systems in our daily environment, however, is often difficult in practice due to the requirement of camera infrastructure with precisely controlled lighting. In this dissertation, I will develop a computational model that can reconstruct a 3D human avatar from a single camera whose quality is equivalent to that from multi-camera system by learning from data. The main challenge for learning to reconstruct a 3D avatar from a single camera comes from the lack of 3D ground truth data. A distribution of human geometry and appearance is extremely diverse, depending on a number of parameters such as identity, shape (slim vs. fat), pose, apparel style, viewpoint, and illumination. While a data-driven model requires to learn from the data that can span such diversity, no such data exists to date. I address this challenge by developing a set of self-supervised algorithms that allow learning a generalizable visual representation of dynamic humans to reconstruct a 3D avatar from a single camera; to adapt the 3D avatar to unconstrained environment; and to render fine-grained appearance of the 3D avatar. [Learning to reconstruct a 3D avatar from a single view image.]Large 3D ground truth data are required to learn a visual representation which describes the geometry and appearance of dynamic humans. I collect a large corpus of training data from a number of people using a multi-camera system which allows measuring a human with minimum occlusion. 107 synchronized HD cameras capture 772 subjects across gender, ethnicity, age, and garment style with assorted body poses. From the multiview image streams, I reconstruct 3D mesh models to represent human geometry and appearance without missing parts. By learning the images and reconstruction results, the AI model can generate a complete 3D avatar from a single view image. [Learning to adapt the learned 3D avatar to general unconstrained scenes.]The quality of the learned 3D avatar is often degraded when the visual statistics of the testing data largely deviates from that of the training data, e.g., the lighting in the controlled lab environment (training) is very different from the unconstrained outside environment (testing). To mitigate such domain mismatch, I introduce a new learning algorithm that can adapt the learned 3D avatars to unconstrained scenes by enforcing the spatial and temporal appearance consistency, i.e., the appearance of the generated 3D avatar should be consistent with the one observed from the image of unconstrained scenes and the one generated from the previous time. Applying these consistency to a short sequence of testing images makes it possible to refine the visual representation without any 3D ground truth data, allowing to generate high-fidelity 3D avatars from everywhere. [Learning to render fine-grained appearance of the 3D avatars from diverse people.]High quality geometry is the main requirement for fine-grained appearance rendering of a 3D avatar. However, the learned visual representation is designed to reconstruct such geometry only for the limited number of people (e.g., a single subject) due to the lack of 3D ground truth data, which no longer exists for other subjects out of training data. I bypass this problem by introducing a pose transfer network that learns to render fine-grained appearance without high quality geometry. Specifically, a pose encoder encodes the pose information from a 3D body model that represents the coarse surface geometry of general undressed humans, and an appearance decoder generates the fine-grained appearance (sharp 2D silhouette and detailed local texture) which is reflective of the encoded body pose for a specific subject seen from a single image. We further embed the 3D motion representation to the encoder in a form of temporal derivatives of 3D body models observed from a video, which allows the decoder to augment the physical plausibility by rendering the motion-dependent texture, i.e., wrinkle and shade on the clothing that are motivated by human movements. Eliminating the requirement of the high quality geometry brings out strong generalization of the rendering model to anybody from a single image or video. In the experiment, I demonstrate that the reconstructed 3D avatar is accurate and temporally smooth; the learned visual representation is highly generalizable to diverse scenes and people; and the rendering results of the 3D avatars is photorealistic compared to previous 3D human modeling and rendering methods. Beyond social tele-presence, enabling various applications is also possible: I apply the learned human visual representation to creating bullet time effect, image relighting, virtual navigation of a 3D scene with people, motion transfer and video generation from a still image.


University of Minnesota Ph.D. dissertation. 2022. Major: Computer Science. Advisor: Hyun Soo Park. 1 computer file (PDF); 202 pages.

Related to




Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Yoon, Jae Shin. (2022). Metaverse in the Wild: Modeling, Adapting, and Rendering of 3D Human Avatars from a Single Camera. Retrieved from the University Digital Conservancy,

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.