Learning graph-structured representations for robotic manipulation

Rezazadeh, Alireza2025-02-262025-02-262024-08https://hdl.handle.net/11299/270072University of Minnesota Ph.D. dissertation. August 2024. Major: Electrical/Computer Engineering. Advisors: Changhyun Choi, Karthik Desingh. 1 computer file (PDF); ix, 111 pages.Robotic manipulation is evolving to meet the demands of varied and unpredictable environments, moving beyond repetitive tasks performed in controlled industrial settings. Traditional approaches, which rely on extensive manual programming, struggle with scalability and adaptability in such unpredictable environments. As an alternative, data-driven robotic solutions propose learning manipulation skills from sensor observations, including RGB and depth camera data. Recent advances in learning-based solutions show promising results, with robots adapting their manipulation skills to new tasks and environments using sensory inputs, significantly enhancing their potential for operational scalability. A crucial aspect of this learning process is representing the world as observed through sensor measurements. Sensory measurements are often inherently structured, containing spatial and temporal information. For instance, an RGB camera collects a series of images that capture the composition of an environment, depicting objects and their relationships over time. A natural way to formalize this structured information is through graphs, where entities are abstracted as nodes and their pairwise relationships as edges. A graph representation can encode sensory information while preserving its inherent structure. Obtaining effective graph representations from sensor data poses several challenges. Explicitly defining node and edge attributes is impractical for real-world robotic applications. For instance, using off-the-shelf models to detect object states in multi-object scenes is not scalable as it requires extensive ground-truth data for fine-tuning. Similarly, defining explicit relationships (edges) between elements (nodes) faces the same limitations. Therefore, it is essential to develop learning-based graph structures where nodes and edges are learned from sensor observations without direct supervision. This thesis explores learning graph-based representations of the world, derived from sensor observations, to improve the learning of robotic manipulation tasks. Initially, we focus on integrating sensor observations from multiple modalities—RGB, depth, and tactile—of a single in-hand object into a unified graph-based representation. This representation encapsulates multimodal information about the current state of the object, including the geometrical structure of its surface obtained from depth and tactile data. Consequently, this object-centric graph representation abstracts the state of the object in real-time, including its position and orientation in 3D, which is essential for in-hand robotic manipulation. We then extend from single-object to multi-object environments, addressing the challenge of representation learning for multiple objects based solely on RGB camera data. Tasks like goal-oriented pushing require the robot to predict the outcomes of its actions on the entire scene. By representing the environment in a graph structure where nodes correspond to objects and edges represent their relationships, we enable the robot to plan actions more efficiently by accounting for multi-object dynamics. Our general approach for multi-object environments involves discovering unsupervised object-centric representations from visual data and subsequently learning the dynamics of multiple objects within these representations, structured as graphs. Learning the dynamics of unsupervised object-centric representations avoids the need for direct supervision on object states, which is often impractical to collect in real-world settings, and instead leverages readily available visual observations. Finally, we tackle the challenge of ensuring that learned object-centric representations and their dynamics are invariant to changes in RGB camera viewpoints. In real-world settings, robots often need to operate under varying angles and perspectives. By learning representations and dynamics consistent across different viewpoints, we enhance the robot's ability to interact with objects from multiple perspectives, allowing it to function beyond a fixed viewpoint and operate flexibly from various camera angles.enComputer VisionDeep LearningRobotic ManipulationLearning graph-structured representations for robotic manipulationThesis or Dissertation