Browsing by Subject "Vision Transformer"

Now showing 1 - 1 of 1

Evaluating Robotic Manipulation with Depth Data and Pretraining
(2024-12-05) Hawver, Mason; Diaz, Ryan; Cui, Hanchen
Abstract Visual imitation learning (IL) has been approached through end-to-end learning and pre-training methods. While pre-training on large datasets like ImageNet improves sample efficiency, it often struggles with out-of-distribution (OOD) data and fails to update the encoder alongside the policy. Recent studies suggest that multi-modal pre-training can enhance the robustness of downstream policies. In this project, we propose a novel approach to pre-training on in-distribution robotic manipulation datasets, integrating multi-modal sensor data and task-specific objectives to improve robustness. Our goal is to train a simulated robot to perform contact-rich tasks, such as T push, rearrangement, 3-piece assembly, and coffee assembly, and compare our method with existing approaches. We will collect a multi-modal dataset using the Robosuite simulator and augment it with demonstrations generated via the MimicGen framework. A Vision Transformer (ViT) will be trained using self-supervised learning to process masked multi-modal inputs, including RGB and depth images, force-torque sensor readings, and proprioceptive data. The resulting latent embeddings will serve as inputs for policy learning, implemented through behavior cloning with recurrent neural networks (BC-RNN) and diffusion policy learning. We will evaluate our method against other pre-trained visual encoders, measuring task success rates and robustness to distributional shifts. Our work aims to demonstrate the effectiveness of multi-modal pre-training in enhancing the performance and generalization of robotic manipulation policies. Keywords: Visual imitation learning, multi-modal pre-training, robotic manipulation, Vision Transformer, self-supervised learning, contact-rich tasks.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Vision Transformer"