Evaluating Robotic Manipulation with Depth Data and Pretraining

Abstract Visual imitation learning (IL) has been approached through end-to-end learning and pre-training methods. While pre-training on large datasets like ImageNet improves sample efficiency, it often struggles with out-of-distribution (OOD) data and fails to update the encoder alongside the policy. Recent studies suggest that multi-modal pre-training can enhance the robustness of downstream policies. In this project, we propose a novel approach to pre-training on in-distribution robotic manipulation datasets, integrating multi-modal sensor data and task-specific objectives to improve robustness. Our goal is to train a simulated robot to perform contact-rich tasks, such as T push, rearrangement, 3-piece assembly, and coffee assembly, and compare our method with existing approaches. We will collect a multi-modal dataset using the Robosuite simulator and augment it with demonstrations generated via the MimicGen framework. A Vision Transformer (ViT) will be trained using self-supervised learning to process masked multi-modal inputs, including RGB and depth images, force-torque sensor readings, and proprioceptive data. The resulting latent embeddings will serve as inputs for policy learning, implemented through behavior cloning with recurrent neural networks (BC-RNN) and diffusion policy learning. We will evaluate our method against other pre-trained visual encoders, measuring task success rates and robustness to distributional shifts. Our work aims to demonstrate the effectiveness of multi-modal pre-training in enhancing the performance and generalization of robotic manipulation policies. Keywords: Visual imitation learning, multi-modal pre-training, robotic manipulation, Vision Transformer, self-supervised learning, contact-rich tasks.

Keywords

Visual imitation learning

multi-modal pre-training

Vision Transformer

self-supervised learning

contact-rich tasks

Description

This UROP submission consists of a proposal where we outlined the projects goals, a video visualization of the data collected, a poster of the results we achieved, and a video outlining those results.

Collections

UMTC Undergraduate Research Presentations and Papers (UROP)

Funding information

This research was supported by the Undergraduate Research Opportunities Program (UROP).

Suggested citation

Hawver, Mason; Diaz, Ryan; Cui, Hanchen. (2024). Evaluating Robotic Manipulation with Depth Data and Pretraining. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/269714.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Evaluating Robotic Manipulation with Depth Data and Pretraining

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Evaluating Robotic Manipulation with Depth Data and Pretraining

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Alternative title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation