Interface Issues in Robot Scrub Nurse Design A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Amer Agovic IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Prof. Nikolaos Papanikolopoulos, Prof. Ahmed Tewfik December, 2011 c© Amer Agovic 2011 ALL RIGHTS RESERVED Acknowledgements On a personal level, the most important accomplishment would have to be the realization that reaching the full potential can only happen with the help of others. I spent much of the time in graduate school in a state of emotional conflict. For the longest time I tried to avoid working for anything or anyone but myself. In hindsight, I realize that painful events outside my control shaped this attitude which makes things only worse. It is within this context that I came to appreciate the professors and students mentioned here. Only with their help was I able to come this far. First and foremost I would like to thank my advisors Prof. Nikolaos Papanikolopou- los and Prof. Ahmed Tewfik. They both demonstrated great deal of patience and allowed me to make up my mind even if it took a little longer. Prof. Tewfik was there during the time when I was still looking for a problem. In one session we came across medical robotics and that changed everything. Prof. Nikos, later, provided the environ- ment to further grow. Even after two years of fruitless publication attempts he had faith in me until one day I finally published my first work. One of the most exciting points came during our application for RSN funding. The manner in which they approached the problem formulation and the presentation showed me that I was in the company of truly great people. From that moment on I tried to imitate what I saw and it has been the most rewarding experience of all. Prof. Samuel Levine and Prof. James Holte deserve a lot of the credit for my selection of the robot scrub nurse (RSN) as a problem area. I am truly grateful to know such inspiring individuals. As my teaching supervisor Prof. Holte took time to find out what I was doing and together we had numerous brainstorming sessions. His way of looking at problems epitomizes the creative and out-of-box thinking. Tapping on his experience I would eventually meet with Prof. Levine and learn about problems in the i operating room. It is no exaggeration when I say that Dr. Levine has transformed my view of research and my place in it. I had the whimsical tendency to go off tangent but after meetings with Prof. Levine I would get a sense on how to focus and organize intellectual effort. I am convinced that without his help and advice none my work would be noteworthy. Other professors which enabled and supported me over the years include Prof. Peter Olver as a committee member, Prof. Maria Gini as my UROP supervisor, Prof. Jaijeet Roychowdhury as my honors advisor. In addition, I learned a great deal from my teaching supervisors Prof. Robbins, Prof. Wang and Prof. Higman. For their support in researching scrub nurse robots I am very grateful to the people at the Institute for Engineering in Medicine. Their foresight and support allowed me to purse a truly novel research. Besides the professors a number of students provided essential help and feedback. My brother Amrudin Agovic took time from his own research to proof read mine. He had me revise my first successful conference paper for six hours straight. It did not sit well with me then but I sure appreciate it now. Alex Kossett helped me design the RSN gripper. If I was arrogant enough to wonder why you would need mechanical engineers by the time Alex designed the gripper that notion morphed into humility and appreciation for delegating work. Other students who provided valuable experience and assistance include Robert Martin and Joe Levine. Thank you guys. ii Dedication This achievement would be meaningless without mentioning those people whose life served as my compass. I dedicate this work to my entire family and especially to my parents.They faced adversity and smiled at it. They had less education but more wisdom than I have. Above everything they were decent human beings. I would like to especially mention my uncle Tosum. My fascination with electricity started when he suggested a career as a refrigerator repairman ;-) iii Abstract The idea of a robot providing assistance in the medical field is very promising. Due to the myriad of unresolved issues a full therapist robot is still out of our reach. In this study we approach the problem in an application area that is sufficiently constrained to allow us a certain degree of practical success. Concretely we present a system description of a robotic scrub nurse (RSN) for microsurgery. We identify robot interfacing as a crucial problem to be solved. In that regard we examine how spoken language could be utilized, how vision could guide robot motion, and finally we examined human-robot interaction at the haptic level. For the haptic interface we have designed and evaluated a shape conforming grasp mechanism. Such an approach is uncommon in the robotic industry where per-task adapter-style exchangeable grasping mechanisms are mounted on the robot for each run. While industrial robots deal with few instruments and must lift heavy weights, our problem domain contains hundreds of instruments and prohibits any type of crushing or cutting movements. Our results provide an integration study of the various components for assistant robots. Furthermore we contribute novel insights in the design of grasp planning methods, visual tracking methods and natural language dialogue systems. iv Contents Acknowledgements i Dedication iii Abstract iv List of Tables ix List of Figures x 1 Introduction 1 2 Background 6 2.1 Robots in Surgery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Acceptance of Robots in Surgery . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Economic Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Social Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Grasp Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 Visual Servoing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.1 Depth Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5.2 Alignment of 3D Points . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.3 Selection of Good Features and Clustering . . . . . . . . . . . . . 18 2.5.4 Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . 19 v 3 Overview 20 3.1 The Environment: Microsurgery in Otolaryngology . . . . . . . . . . . . 21 3.2 Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Robot Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Motion Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5 Human - Robot Interface (HRI) . . . . . . . . . . . . . . . . . . . . . . . 24 4 Haptic Interface 27 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Gripper Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Haptic Communication Signaling . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Grasp Planning 37 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.3.1 Pose Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.2 Practical Consideration . . . . . . . . . . . . . . . . . . . . . . . 46 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.4.1 Pose Alignment Validation . . . . . . . . . . . . . . . . . . . . . 49 5.4.2 Selection Strategy of Pairwise Features for Grasp Planning . . . 50 5.4.3 Grasp Planning Results . . . . . . . . . . . . . . . . . . . . . . . 52 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Visual tracking 56 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 vi 6.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.3 Depth Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3.4 Pose Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3.5 Selection of Good Features . . . . . . . . . . . . . . . . . . . . . 64 6.3.6 Projective Reconstruction . . . . . . . . . . . . . . . . . . . . . . 66 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4.2 Depth Recovery Validation . . . . . . . . . . . . . . . . . . . . . 68 6.4.3 Good Features to Track . . . . . . . . . . . . . . . . . . . . . . . 70 6.4.4 Tracking Validation . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 Spoken Command Interface 75 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 The Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.4 Parsing and Syntactic Analysis . . . . . . . . . . . . . . . . . . . . . . . 78 7.5 Meaning and Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . 81 7.6 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8 Summary 90 8.1 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.2 Robot Scrub Nurse in Action . . . . . . . . . . . . . . . . . . . . . . . . 92 8.3 Remaining Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 References 98 Appendix A. Glossary and Acronyms 110 A.1 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii A.2 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 viii List of Tables 6.1 Tracking error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Tracking error continued . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.1 SPHINX-II/POCKETSPHINX Recognition performance . . . . . . . . . 87 7.2 Role of the CYK parsing on recognition performance . . . . . . . . . . . 87 7.3 Role of the semantic analysis on recognition performance . . . . . . . . 88 A.1 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 ix List of Figures 1.1 Problem setting and interfaces . . . . . . . . . . . . . . . . . . . . . . . 2 3.1 Main system components . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Microsurgery operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Instrument panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Motion planning using potential fields . . . . . . . . . . . . . . . . . . . 24 3.5 Nurse-surgeon handshake . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 Behavior model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Point contact vs surface contact . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Gripper assembly (right) and its computer model (left) . . . . . . . . . . 30 4.4 Gripper state transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5 Active state: jaw open, membrane inflated and control law on . . . . . . 33 4.6 Detecting when to take the instrument . . . . . . . . . . . . . . . . . . . 33 4.7 Detecting when to give the instrument . . . . . . . . . . . . . . . . . . . 34 4.8 Valid instrument: grasping a knife . . . . . . . . . . . . . . . . . . . . . 34 4.9 A mistake: grasping a finger . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.10 Transporting instrument: start, move and stop . . . . . . . . . . . . . . 36 5.1 Gripper sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Friction cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3 Design of features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 Pose recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.5 Pairwise feature alignment planner . . . . . . . . . . . . . . . . . . . . . 46 5.6 Case for penetration test . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.7 Selected targets of varying polygon count . . . . . . . . . . . . . . . . . 49 5.8 Alignment error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 x 5.9 Feature selection strategies . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.10 MDS Projection of gripper and target in feature space . . . . . . . . . . 52 5.11 Quality of alignment vs quality of grasp . . . . . . . . . . . . . . . . . . 53 5.12 An example of alignment grasps . . . . . . . . . . . . . . . . . . . . . . . 53 5.13 An example of alignment grasps . . . . . . . . . . . . . . . . . . . . . . . 54 5.14 An example of alignment grasps . . . . . . . . . . . . . . . . . . . . . . . 54 6.1 Computer vision pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Computer vision model acquisition . . . . . . . . . . . . . . . . . . . . . 61 6.3 Recovering depth from three points . . . . . . . . . . . . . . . . . . . . . 63 6.4 Feature correspondence (with and without radius rejection) . . . . . . . 65 6.5 Clustering process (across frames and within cluster). . . . . . . . . . . 66 6.6 Feature clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.7 Selected objects (baseline + tags) . . . . . . . . . . . . . . . . . . . . . . 69 6.8 Projection error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.9 Manual inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.10 SURF feature distribution (embedded in 2D using MDS) . . . . . . . . . 71 6.11 SIFT feature distribution (embedded in 2D using MDS) . . . . . . . . . 71 6.12 Well behaved cluster (4/200 frames) . . . . . . . . . . . . . . . . . . . . 72 7.1 Speech pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2 Incremental CYK parse snapshot . . . . . . . . . . . . . . . . . . . . . . 79 7.3 Incremental CYK algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.4 Robot task knowledge model . . . . . . . . . . . . . . . . . . . . . . . . 82 7.5 Gripper knowledge model . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.6 Visual context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.7 Surgical workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.8 Execution plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.9 Dialog manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.1 Intial arm position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.2 Arm above tray before loading . . . . . . . . . . . . . . . . . . . . . . . 93 8.3 Loading instrument from tray . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4 Offering instrument to user . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.5 Taking instrument from user . . . . . . . . . . . . . . . . . . . . . . . . 95 xi 8.6 Unloading instrument to tray . . . . . . . . . . . . . . . . . . . . . . . . 95 xii Chapter 1 Introduction The field of medical robotics can be split into two segments depending on the interac- tion between operator and machine. The first segment includes robots that enable the operator. These robots stand between the operator and the task. One very prominent example is the DaVinci robot [1]. The other segment includes robots that assist the operator. Such assistant robots adhere to the philosophy that the operator knows best. The assumption here is that the operator performs the primary task almost perfectly. In that case introducing an intermediary would only break the flow. One example would be the hand-eye coordination surgeons develop. By researching human-robot interface (HRI) problems for Robot Scrub Nurse (RSN) our work can be placed in the segment of assistant robots. Interfacing for enabling robots simplifies to user interface (UI) de- sign. The interfacing problem for assistant robots is more challenging. Because the robot interacts directly with human operators the workspace is highly unstructured. A robot assistant has to demonstrate that it is safe for humans and that it reliably repli- cates desired activity. Using industrial grade technology and methods is usually not a good option. Industrial settings are more controlled environments. Generally speaking our work attempts to relax the constraints on the environment needed for robot de- ployment. Ultimately the environment should allow interacting robot applications in human workspaces. Our study of interfacing issues is grounded in a real world application. From that perspective this research tends to emphasize empirical and engineering methodologies. Our application of choice is an assistant robot which would serve in the operating 1 2room and automate responsibilities currently assumed by a scrub nurse technicians. Our effort to design and develop a RSN system provides us with an abundant pool of problems. Above all other problems we have identified the human robot interface as the most important. Therefore we give an overview of a Robot Scrub Nurse system but then dedicate greater detail to three interface modalities in particular. They are haptic interface or grasp planning, visual interfacing to guide the robot and spoken command interface to issue commands and query state of the robot. The overall problem setting is best illustrated on hand of the illustration in Figure 1.1. Figure 1.1: Problem setting and interfaces What makes a robotic scrub nurse an attractive application is that we can still make certain simplifying assumptions (Figure 1.1). The ability to constrain the problem is essential. The activities a scrub nurse performs are repetitive, costly, and prone to errors. In microsurgery for Otolaryngology, which is the operating procedure that we have examined, the surgeon makes use of several hundred delicate and costly instruments. The surgeon using an operating microscope does not want to look away to get an instrument. It takes a surgeon time to get his/her mind oriented to the surgical field. Looking away delays the surgeon, allows fluids to fill the area of surgical interest and sometimes requires moving tissue to get an appropriate look at the internal structures. The surgeon-nurse interaction begins with the surgeon requesting an instrument by spoken word. The nurse uses his/her experience and training to identify it, picks it up without dropping it, moves it without hitting obstacles and places it on the stretched-out 3palm of the surgeon by means of visual and haptic (touch) feedback. The nurse makes sure that the instrument does not cut the physician and does not violate sterility. This movement must occur in a timely manner. Sources of delay could come from difficulty in identifying the correct instrument, dropping it, or losing sterility by other means. A good experienced nurse will handle the instruments with care and will be intimately familiar with the procedure so that she can pro-actively deliver the next instrument, even before it is requested. The scrub nurse as the manager of instruments also makes sure that no instruments are left inside the patient after the end of the procedure. Due to the repetitive nature of the job, scrub nurses usually rotate every 30-60 minutes. During the change-over the new nurse must be quickly briefed and the surgeon needs to wait. In general scrub nurses have developed handbooks on how to behave to minimize risks such as forgetting instruments or violating sterility. Clearly a robot must replicate all of the above steps. In fact, there are a number of issues which characterize the overall field: – A scrub nurse is expensive: The average compensation per year is $44,000 [2]. – There is a shortage: In 2008 a combined 91,500 were employed in US hospitals across all specialties [2]. – Monotonous: Memorizing hundreds of almost-identical instruments is not very motivating but forgetting one inside the patient is catastrophic. – Procedural complexity: Regulative guidelines require that a scrub nurse be rotated periodically and per case more than one nurse is needed (per shift, per specialty, per hospital). – Point of delay: Dropping an instrument leads to immediate delays and overall cost increase. – Point of delay: A nurse will delay an operation if unfamiliar with instruments (sheer number), procedure (specialty type), or present situation (after rotation). Given the set of aforementioned issues we note that a small hospital which does not staff multiple scrub nurses is unable to perform an operation such as ear surgery. On the other hand it is estimated that there are about 1 million PE tubes inserted per 4year in the US [3], so the demand for ear surgery is clearly increasing. A robot could potentially alleviate all of these issues by being easier to maintain, quicker to update with procedural changes, faster at responding to a surgeon, more accurate at managing instruments, and allowing more efficient and fulfilling use of human resources. Within this application setting we show how different technologies are integrated into a whole system. We present our contributions in the research of haptics or grasp planning, visual servoing and natural language processing. While all of these fields are more or less mature none of them has been considered carefully for assistant robots like the RSN. To research haptic interfacing we designed and studied a novel shape conform- ing grasp mechanism. Utilizing an inflatable membrane our gripper is able to generate a soft grasp which is essential for human interaction. Handing of instruments to surgeons and taking them back involves domain specific knowledge of what could be called hap- tic communication protocols. The exchange of instruments must be repeatable, safe for humans and safe for the material. Besides using touch to actuate instrument exchange we looked at the problem of localization. Localizing the instruments on the instrument tray is not as challenging as locating the hand of the surgeon. Complicating matter is the possibility of obstructions such as microscopes used by the surgeons and various hoses attached to instruments. Of all the sensing modalities the computer vision offers the most versatility. However computer vision is still not robust enough for arbitrary tasks. In response we studied how the environment of the task can be engineered. We looked at various geometric tag patterns and their fidelity in being tracked accurately and robustly. Finally we examine the spoken command interface. In order for the robot assistant to be easily accepted into an environment which is well established and rigid procedurally the introducing technology must be as unobtrusive as possible. A speech recognition and synthesis system that understands natural human language would be optimal in that regard. Our work adapted an existing solution and measured its effec- tiveness in speech recognition tasks. To that end we made use of context free grammars to constrain syntax and improve accuracy. As the final step of the spoken command problem a dialogue management system is presented. In the rest of this document we first survey the existing literature. After that we describe the application area for which the RSN is needed and present an overview of various RSN components/problems. Next the interfacing modalities are explored 5further. Within the haptics section we go into more detail in two areas. First we present our gripper design and high level haptic signaling. Then we cover more abstract problem of grasp planning. These results are followed by our work on vision and speech. The final chapter summarizes and discusses our experience in designing, developing and deploying the system. Chapter 2 Background Our review of existing work is split in several parts. First, a short review of robotics in surgery is given. A lot of available work concerns enabling robots and discusses their performance in facilitating surgical procedures. We place emphasis on robot assistants and review work in that direction. Next section explores factors affecting adoption of surgical robots. It is helpful to understand what drives and restrains research in medical robotics. Understanding why robots are so expensive, why only a few companies dominate or how hospitals purchase medical robots should affect our research into robot interfaces. In that spirit we examine economic and social conditions which determine the acceptance rate of this new technology. In the remaining three section we cover more technical work that supports our research on interfaces. We examine research in natural language processing with an emphasis on dialog management. After that we cover work on grasp planning and finally methods needed for computer vision. 2.1 Robots in Surgery The field of surgical robotics is divided into two groups. The first type of robots consists of devices which stand directly between the surgeon and the patient. For example in [1] the robot DaVinci has two components. A console, on which an operator sits, and a possibly remotely controlled instrument stand which is working on the patient. Such 6 7systems make tele-operation possible and can amplify the ability of the operator(i.e., tremor reduction). The drawback of such systems and sometimes the reason for their rejection by the surgeons is that they break hand-eye coordination. The other type of robot, in which category the RSN falls, is called an assistive robot because it stands on the side and provides aid. Penelope is the first commercial scrub nurse robot that was recently developed [4]. This robot and the accompanying patent address almost all features that any similar robot needs to have, such as a spoken language command interface, path planning to deliver instruments, etc. One aspect that was not considered in this work is the grasping mechanism. Instead the authors use a simplistic gripper which only allows deployment in surgeries where the surgeon can actively look for the instrument and grasp it. Development of a trauma care assistant is detailed in [5]. The aim was to develop a scrub robot to assist a tele-operating robot like DaVinci in environments where highly skilled support staff is unavailable. The most prominent challenge is the interface be- tween the nurse robot, the instruments, and the target robot. Unlike our approach the route in which they attempt to solve the problem is more mainstream in that they try to expand on methods already being used in industry. The end result is a flexible instrument adapter which is amenable to calibrations and allows for compensation of errors. In [6] an example of task knowledge modeling is presented. The authors, while working towards a scrub nurse robot, have correctly identified that a prerequisite is the ability to observe and model the behavior of human scrub nurses. To that end a timed automata model is proposed which is trained in unsupervised fashion. This work is further expanded in [7] where the timed automata is then used to analyze recorded patterns and to draw some specific recommendations about the ultimate design needs. An important aspect that this work raises is that of human adaptivity. This concept has to be incorporated into the robot design and is a focal point for our work. Another example of the task modeling problem is the work in [8]. Here the authors look at a well behaved type of surgery (Laparoscopy) and devise a system which allows for logging of instrument usage. The analysis of such data is not only used to model scrub nurse behavior patterns but also to analyze and understand their nature more clearly. 8This work raises the interesting issue of record keeping which would be necessary as part of a feedback system of any scrub nurse replacement. When we think of assistive robots an alternative viewpoint would be the direction taken by the authors in [9]. The aim of the authors has been to create robots that care or give comfort. The idea of a robot acting as a therapist is powerful but we believe that before it becomes reality interface related issues, such as those examined here, will need to be addressed. 2.2 Acceptance of Robots in Surgery The proliferation of robots in surgery is not a straightforward matter. Clearly robots eliminate need for human labor and that is bound to face resistance. Furthermore, robots are expensive and there are only a few dominant manufacturers. It is useful to understand these circumstances. Our look at factors affecting robot acceptance was prompted by two developments during the research. During our first visit to the operating room we failed to mention to the staff the purpose of our visit, which was to observe them for purposes of automating away some of their responsibilities. Naturally on our second visit and after they learned more about it their attitudes changed. At another time closer to the end of the research the office of technology commercialization (OTC) and a local robotics company expressed interest in our work. However shortly thereafter they declined to pursue the idea further. Why this happened has motivated our review in this section. 2.2.1 Economic Conditions According to analysis by Frost & Sullivan [10] the market in surgical robots is divided into three tiers. Tier I is dominated by one or few major players. For example on RSN side Robotic Systems & Technologies Inc. holds a very broad patent. On the enabling robots side Intuitive Surgical Inc. is the dominant player. Tier II is formed by developers of specialized tools catering to products marketed by Tier I players. Finally Tier III holds external and existing robot companies that are trying to diversify into medical devices. The market as a whole is showing increasing revenues while capital investment is going down. It indicates that the market is not saturated but entry by 9new companies is made difficult. Possible product categories include sale of systems, system accessories and servicing. As reported by the same analysts in [10, 11] and [12] major market obstacles include: • High cost to market • Regulatory and Political Factor • Standards and Procedures Regulations • Funding support from government • Resistance from traditionalist surgeons Private investment is harder to obtain because it requires large upfront capital without immediate results. In such case funding by government agencies would be beneficial but it varies from country to country with Europe placing more importance on this market segment. In [13], challenges of rehabilitation robots is discussed. The authors postulate that the field is considered orphaned because high capital investment inhibits faster development. As a way to address cost the authors suggest a collaborative framework model that would utilize the internet as a cost saving measure. On top of these issues the market introduces novel risks/issues such as equal access to care or robotic failure. The technological drivers and benefits which provide incentive to manufacture and adopt this new technology include[10, 11, 12]: • Better image visualization • Shorter surgical time • Enhanced surgery results • Improved surgeon ergonomics • New pathways like telesurgery • Less pain • Fewer complications 10 • Cost reduction • Better cosmetic results • Improved accuracy and efficiency • Workforce efficiency (fewer staff needed) The competitive factors of products determine their penetration of the market. Fol- lowing attributes were identified as significant[10]: • Patient safety • Access to product • Quality of material • Ease of use • Cost • Uniqueness • Strong customer support • Familiar software usage The market intelligence presented here played a role in our case. Both the OTC and the local company declined to pursue the idea any further because they could not identify a way to distinguish themselves from Tier I players and remain profitable. 2.2.2 Social Conditions Our experience with the nursing staff was interesting enough to make us wonder about the social implications of a robot assistant. We deemed it important to investigate which factors would make the RSN more acceptable and looked at previous research to gain a better perspective. Before robots in medicine there was time when robots in industry were not as com- mon. Work in [14] is a bit older but might reveal how robots became so ubiquitous 11 in industry. The favorable adoption factors listed for industrial robots include: robot champion, experience with automation, existing automation, bad working conditions, careful assessment of economic and alternative performance. A robot champion is a successful first implementation that shows promise. Interestingly this study finds that labor response was not as bad or crucial as expected. At the same time the authors claim that robots do not always remove monotonous tasks but might introduce other repetitive actions. The robotization affects the structure of work from people-centered to machine-centered. Finally an impediment can be the lack of trained staff to work with a robot. Some problems with robotization include: difficulty to run the system, selection of robot model, long development periods, increased managerial effort, orga- nizational resistance, need for extensive training, retaining trained personnel. The effect of culture on the adoption process is examined in [15]. The study com- pares differences between United States and Japan and try to identify attributes which affect the adoption process. Some of the cultural factors that are influential include individuality, risk avoidance, social status, time perspective, social competition. Each of these factors affect the adoption rate differently between cultures. The attributes appear in clusters within a culture. Frequently technical and monetary factors in adoption of robots is considered. In [16] the authors look at social psychology and try to derive models that could explain the adoption rate. By applying insights from psychology on how humans perceive technol- ogy and robot interaction it might be possible to design more acceptable interfaces and robots even for household use. They derive some guidelines that address how technology is perceived and how this perceptions are formed. The factors that affect acceptance include safety, accessibility, practical benefit, fun, social pressure, status gains, expecta- tion of social intelligence, experience, media, social network. All of these factors should be considered in interface and robot design and be mindful of perception. From a more technical perspective it is interesting to note research that reports on the use of robots in surgery. An early look at robotics in surgery is given by [17]. The authors raise key issues that are still relevant today. They categorize the interaction modes between surgeon and robot along the level of cooperation, from fully autonomous robot operation to fully guided. The level of interaction is determined by the active constraints set on the robot. 12 Another interaction mode mention is that of teleoperation and telemonitoring. In their treatment of technical issues they also discuss problems in clinical implementation and acceptance. They see the safety as the most critical issue. Beyond safety other issues mentioned are cost and outcome. In [18] the authors examine how novel technology affects communication and the performance in the operating room. The impact of the study sheds new light on how effective technology enhances performance and how well it is accepted by users. Intro- duction of technology changes the responsibilities each role assumes. It appears that the scripted communication pattern performs better than either automated or no-rule patterns. The success of communication in great deal depends on the mental state shared between the nurse and surgeon. One possible explanation is that homogeneous communication allows participants to detect anomalies in speech better. However while technology might offer performance improvements it frequently increases the number of steps. In this study the task decomposition shows that the complexity significantly increases and also that the number of modalities involved increases to compensate for limits imposed by new technology. In [19] a method for evaluation of robot safety is presented. This work is useful in that it models risks and hazards of a robot in its interaction with the environment. Following the principles laid out in their hazard identification and safety insurance control policy one obtains a plan not only to avoid adverse consequences arising from an issue but also for debugging and ensuring quality of service. Something we might revisit to evaluate our system. Work in [20, 21] examines recent experience with robots in laparoscopic procedures. In about 10% of cases robotic procedure had to be converted to plain one because they could not be finished with robots. In one case robot failed in others minor bleeding could not be stopped. Learning curve for robotic procedures was 10 sessions or more. Main drawback seems to be inflexible instruments and lack of arms. The benefit to patient is not yet fully evaluated. Besides long learning curve some of the pitfalls are unstable camera, limited motion, two dimensional imaging, poor ergonomics. 13 2.3 Natural Language Processing One of the early works on comprehensive natural language processing systems can be found in [22]. Here a dialog system is described which is used to direct trains. It is based on available speech recognition software and addresses problems in post-processing of recognized speech, parsing, discourse management, and verbal reasoning. Another interesting study of dialog managers is presented in [23]. The system is called “Com- mandTalk” and the application area in this case is a front end to a battle-field simulator. Of particular interest is their use of finite state machines (FSMs) to model discourse. Our work is more similar to recent works such as [24, 25]. In both of these cases the environment is a robot assistant in the household. Within these settings an attempt is made to fuse various interface modalities into one decision process. The goal of fusing multiple modalities is to help in disambiguating the state and user intent. As in our case the dialog manager is based on a finite state machine (FSM). In contrast we only considered speech even though visual and haptic events are available to the RSN. An alternative approach to dialog managers can be found in [26]. In this study the dialog manager is based on Partially Observed Markov Decision Processes. It shows how such structure can handle noise and ambiguous cases and how it behaves as conditions degrade. It is motivated by the fact that the intention of the user is only partially observable, so it models the intention of the user rather than the state of the machine. Other interesting dialog managers can be found in [27, 28, 29] and [30]. As part of the high-level dialog management we need to look at the parsing of recognized speech and modeling of task related knowledge. Our setting asks for an integration of these elements and reasoning under uncertainty. We find some attempts that use statistical methods to couple various levels of the recognition process in [31, 32]. In the first paper use of context free grammars is advocated and a lattice based parsing method is presented. The second paper describes an entire robot programming framework. It also makes use of lattice processing but the framework (including low level recognition) is based on the Dynamic Bayesian networks (DBNs). A good resource for a solid treatment of statistical based language processing can be found in [33]. It presents methods for the probabilistic handling of parsing, semantic analysis and discourse. On the parsing side one big issue to address is management or speech repairs. 14 Namely, how should the parser behave if the speaker switches sentences in mid-air. In this regard work in [34] and [35] discusses strategies to detect when such events oc- cur. Frequently the switch is made to better articulate the same contextual meaning. However we cannot rule out the possibility that the context needs to be switched. With context we understand here a description of the robot’s state in the environment. As shown in [36, 37] we can effectively use the context to improve the accuracy and robust- ness of the discourse manager. For example if the robot is moving from the instrument tray towards the surgeon some sentences will not make sense and that should be encoded in some way. Very interesting work is presented in [38]. Here the authors present a system for the management of hazards in an aircraft. It uses Bayesian networks to reason about the environment and sort, schedule, and issue advisories and alerts to the pilot. It is interesting because some aspects of the chaotic nature in an aircraft resemble that of an operating room. Two aspects in this work translate to our work. First is reasoning about the state of the environment under uncertainty. The other is the management of hazards. The interaction of the robot nurse with the surgeon must be primarily seen as a hazard minimization process. Available literature covering all of these topics is extensive. Our goal here was to combine some of these ideas into a real-time system. Clearly our problem setting alters the importance of various functions with hazard minimization being at the top. 2.4 Grasp Planning For an overview of grasp planning we refer the reader to [39] and [40]. While grasp planning overlaps with other fields such as haptics one can define it as the search for the best way to grasp an object. This problem is complicated by the geometry of the object, modeling of the grasp contact, number of contacts, nature of external forces, and the parameterization of the gripper. The earliest attempts were based on the idea of “form closure” which tries to determine when an object becomes immovable. Form closure is mostly determined by the geometry but it requires more contacts than are absolutely necessary. If one considers friction as part of the contact model we can find grasps with fewer contacts. Such grasps are based on force closure. 15 Work done in [41] and [42] is arguably the basis for many efforts in this area. It starts by considering point contacts with friction. The friction cone is approximated by a pyramid. The overal impact of external wrenches or forces is combined into a convex hull which contains the origin if we have achieved force closure. The distance of the origin to the border of such a convex hull is usually used as the quality metric. Another example of analytic grasp synthesis can be found in [43]. The authors formulate a quality metric based on the Q-distance which is differentiable and therefore allows simple gradient descent to be used in the search process. An attempt to overcome non-uniformity of the geometry is presented in [44]. Here a grasp quality measure is derived which approximates the grasp wrench space via spherical shapes that account for the worst-case disturbances. Another approach to grasp quality functions is shown in [45]. However the assumption here is that the geometry is smooth and numerically well behaved. For a more realistic contact model soft contacts have to be considered. The initial attempt at soft contacts addressed the issue of sliding [46]. Soft contact modeling demands the application of the more elaborate Hertzian analytic model [39, 47]. According to this model moments and forces at a contact are coupled. The friction cone of the Coulomb model is replaced with a friction limit surface. This appears to be the primary reason why soft contact modeling has not found wider usage. More recently in [48] an approach was presented which treats limit surfaces in a similar fashion as friction cones in [41]. The limit surface is approximated by a convex polyhedron. Even if the contacts are understood and a quality metric is formulated, there still re- mains the issue of gripper parameterization. The problem is one of correspondence and the question is how we position, orient and articulate the gripper in order to maximize the grasp quality. One interesting approach has been proposed in [49, 50]. The idea is to find the gripper parameters if one knows the contact points on the target which yield a stable grasp. Both sources settle on an optimization scheme to find the best param- eterization. The dilemma here is whether contacts generate a gripper configuration or if it is the other way around. As a purely analytic method, the need for optimization means that real-time application is limited. In this paper we assume that the gripper configuration determines contacts. The impact of geometry is considered by the authors in [51]. The shape is assumed concave and convex decomposition is used to divide and conquer the problem. Along 16 similar lines use of shape primitives has been proposed [52]. In fact the authors pursue a more heuristic methodology. Besides using shape primitives for geometry representation the search for grasps occurs over a predetermined set of gripper preshapes. The grasp quality evaluation happens on-line by means of a grasping simulator. Building on the heuristic method of using simple geometric preshapes the authors in [53] propose a more generic means of object representation. They use hierarchical decomposition trees in terms of quadrics. One disadvantage of the approach is the tree decomposition which is a clustering problem coupled with the actual fitting of superquadrics. Quadrics were cho- sen over other decompositions because they encode normals naturally. Similarly in [54] authors present work on a prosthetics system to enable human teleoperation of robotic grasping. The system achieves real-time performance by reducing the dimensionality of the gripper DOF space. While this simplifies the search, it yields suboptimal results which requires an on-line validation step to fine-tune the grasp. Instead of looking for the optimal grasp analytically, the aim of the authors was to reduce the search space while still preserving most of the good grasps. Having considered geometry, contact modeling, grasp quality and gripper param- eterization the remaining problem is reachability. The search for a good grasp will inevitably yield multiple possible grasps [39]. In [55] the authors examine a scoring method as a way to rank the identified grasps. In particular this approach looks at how nearby clutter might affect a grasp planning algorithm. The method is based on force closure. Alignment of 3D objects is a research field in its own right. The field is mainly driven by media content retrieval problems and registration of medical data [56]. The methods can be roughly split into optimization-based approaches where object geometry is directly used [57], and methods which use a variety of local and global features usually invariant to certain transformations. An interesting idea is the use of spherical harmonics [58]. The problem with the majority of these methods is that they try to solve the alignment problem between whole objects. Partial matching is evidently more difficult. Shape descriptors have found wide usage in computer vision where they have lead to decent results in classification [59]. An early example of shape driven grasp planning is presented in [60]. Here we observe the use of antipodal grasps. The underlying basis 17 is found in force closure and as such this can be considered part of the more general approaches complicated by visual sensing. 2.5 Visual Servoing The problem of visual tracking which is at the core of this paper is quite mature. We have made every effort to review previous work and to make use of it as much as we could. In the following paragraphs we only mention a limited number of sources which played an instrumental role in our effort. Among the integration studies work in [61] is similar to ours as it describes a vision driven grasp planning robot. Further, we only give sparse mention regarding the choice of features used. Two popular feature types that exhibit some invariance are SURF and SIFT [62, 63]. Other invariant features exist but we have not included them [64, 65] or we did not have much success in using them [66, 67, 68]. Finally, we should note that all of the model fitting methods described for depth recovery or point alignment can be embedded inside an optimization routine such as RANSAC [67, 69] to combat outliers. 2.5.1 Depth Recovery Given correspondence, recovering depth is possible even for a single triangle. The sim- plest problem of depth recovery for triangles has been known since 1841 [70]. Usually, the solution to a three point depth recovery yields a fourth order polynomial and hence four possible solutions. In [71] the authors use the same idea. It is derived from pho- togrammetry and is based on the laws of cosines. The end-result is an eight-order polynomial which can be considered as fourth order if constraints are taken into ac- count. For better insight into the theoretical foundation of this result, please see [72]. The ambiguity arising in the three point case is modest because often only two roots are real. Furthermore, the area of the triangle can be used as an additional constraint to obtain a unique result. For the more general cases of four or more points in cor- respondence it is possible to recover depth uniquely. Notable examples can be found in [73, 74]. The work in [74] describes the popular POSIT algorithm and is iterative in nature. More recent work that performs complete depth recovery and alignment is presented in [75]. It is not iterative, requires at least four points and runs in time O(n). 18 2.5.2 Alignment of 3D Points After depth recovery alignment of 3D point clouds is used to recover the pose of the model. A very concise review of four major methods is presented in [76]. Two basic methods we decided to use are [77, 78]. The method in [77] is based on singular value decomposition. The correlation matrix is decomposed into eigenvectors which are then used to recover rotation. An eigenvalue approach is taken by [78]. Here the rotation is obtained by recovering the eigenvector of the largest eigenvalue of a quaternion matrix. In [79] we used an alternative method based on cross products to determine the rotation matrix. Our method requires only two correspondences and does not depend on complex matrix decomposition. In all cases the rotation matrix is determined and then used to recover translation. 2.5.3 Selection of Good Features and Clustering In computer vision the term feature selection appears to be ambiguous. In this paper we have picked the feature type. We do not attempt to construct a feature. Rather, with feature selection we mean attributing to a feature a quality to perform as expected. We want to know if a feature is amenable for correspondence matching. Earlier work on this subject can be found in [80, 81, 82]. In [80] most of the effort is spent on enforcing temporal and spatial coherence. Still determination when a feature is lost is based on how far it wanders from an initial position. Important for this task is also how we measure similarity between features. As shown in [83] the estimation of feature distributions and its modeling can have a significant impact on the matching results. Some efforts [84] have bundled this task together with feature extraction in a classifier framework. For correspondence matching a classifier or quantization framework appears to be a natural setting. The features need to be either rejected or accepted and if accepted then assigned to one of a limited number of code words. Other interesting routines that look at correspondence are based on nearest neighbor classification and can be found in [85]. For an extensive review of clustering methods the reader is referred to [86, 87]. Among the various clustering methods one elegant approach is based on mixture models [88]. For correspondence, however, we need unambiguous cluster assignments. For clear cut cluster formation the agglomerative and graph theoretic clustering methods 19 appear a better fit. An optimization idea using graph matching is proposed in [89]. 2.5.4 Structure from Motion Arguably one of the greatest contributions in computer vision from the 90s involves methods concerning structure from motion. In [67] an entire chapter is dedicated to the various aspects of this problem. A thorough treatment of the subject and a toolbox is presented in [90]. The appeal of this method is its completeness and the treatment of missing points. It is based on matrix factorization. Another useful toolbox is provided in [91]. Besides implementing more classic algorithms like the Tomasi-Kanade method [91], a more recent method from [92] is also available. One issue with any of these methods is how they deal with missing points. The expectation is that a fixed number of points is available for a set of frames. A realistic problem is when a fixed number of points is available for every frame but their composition is not constant but interlocking over the set of frames. Chapter 3 Overview The goal of our project is to develop a robot capable of automating scrub nurse respon- sibilities. A scrub nurse performs a number of tasks but for our purposes we limited the scope to the task of instrument management and delivery to the primary surgeon. The overall diagram (Figure 3.1) of our proposed system follows similar proposals else- where [4]. Some components such as a robot platform, visual servoing, motion planning, speech recognition are necessary requirements. Our first step was always to consider off-the-shelf solutions. In cases where we use existing work we explain how it was integrated and refer the reader to proper sources for more detail. We frame the role of each interface but then spend most of the time on the haptic interface. For the haptic interface we designed a novel shape-conforming gripper. Our gripper is simple in its mechanical design and uses inflatable membranes as a way to generate shape conforming grasp surfaces. Such an approach to grasping mechanisms is uncommon in mainstream industry. A more natural approach is to design specialized grippers for each task. Since we have to manage potentially hundreds of instruments that could cause problems, the grasping mechanism needs to be accommodating. In this regard we present novel work. As will be shown such an adaptive and generic gripper allows for the design of locally autonomous responses to ensure that the robot does not crush a human hand or the instrument. The inflatable chamber allows for sensing when the instrument has been removed from the gripper. Before we go into the details of the essential components, we first provide a descrip- tion of the problem. In particular we describe the environment and the instruments 20 21 used. Robot DriverAudio InterfaceManual ControlVisual Interface Coordination Planning Motion & Grasp Discourse ManagerLocalization Figure 3.1: Main system components 3.1 The Environment: Microsurgery in Otolaryngology Otolaryngology is the medical field that treats problems of head, neck, and throat. For example one common procedure is the ear surgery. The surgical procedures are characterized by the delicate nature of the tissue. A typical setup is depicted in Figure 3.2. As can be observed the surgeon works behind a stereo microscope. Hand-eye coordination is impacted by that fact already. Any change in the field of view breaks the hand-eye coordinate even more. As a consequence the surgeon will rely on the scrub nurse technician to hand him/her the tool instead of actively looking for it. A robot scrub nurse like Penelope would not be suitable here. This is because an essential function of the robot is to actively seek out the surgeon’s hand instead of expecting the surgeon to find the robot. The surgeon will also expect the nurse to take the instrument from his hand when finished. This nurse-surgeon handshake occurs almost entirely on a haptic level (by touch). The number of instruments used in this type of surgery goes into hundreds. While the surgeon is intimately familiar with his field of expertise the scrub nurse might not be. In addition to the surgeon and the scrub nurse, other staff involved in the surgery include an anesthesiologist, a circulating nurse, and any number of additional instructor or student surgeons. During the operation a circulating nurse which is not sterile can 22 enter and leave the room as needed. The anesthesiologist is stationary and of little concern. The surgeon is frequently accompanied by a team of students or colleagues. At times the primary surgeon sitting behind the microscope might address either the scrub nurse, the circulating nurse, or one of his colleagues. Due to union and safety regulations the scrub nurse must rotate every 30 minutes. We observed that during one particular procedure the scrub nurse was utilized by the surgeon for 16 minutes in one hour. This number might not be representative of every procedure but nevertheless it raises the question of scrub nurse downtime. Figure 3.2: Microsurgery operation 3.2 Instruments The instruments in microsurgery for otolaryngology are in some ways are similar to instruments used by dentists. An example instrument tray is shown in Figure 3.3. They are usually made of metallic materials but some tools such as pumps could be plastic or rubber based. Most tools are discrete objects except power tools which might attach to cords or hoses. For attached instruments the nurse must take care of the instrument and make sure that the hose is not in the way. The working tips of some instruments such as knifes are so small that for the untrained eye, it is difficult to distinguish between them. Their shape is mostly planar and they weigh few ounces at most. Instrument trays will vary from procedure to procedure. They are designed 23 for human operations. More streamlined instrument servers could be devised to serve a robot. The exact position of each instrument can be recorded ahead of time for automation purposes. The identification of instruments is amenable to labeling. We have used infrared reflective labels for better visual tracking. Additional magnetic and radio-frequency tags could be used for localization in the 3D space and orientation detection inside the grasping mechanism. Figure 3.3: Instrument panel 3.3 Robot Platform Our robotic platform consists of a PUMA 560 robot controlled by a Linux real-time software platform. The software framework which was chosen is named Robot Operating System (ROS [93]). It models the overall system from Figure 3.1. It is a modular environment whereby each component is serviced by a processing node that runs in a separate thread of execution. The synchronization occurs via message passing. The toughest constraint on the platform is imposed by the motion control node. This node needs to send commands to the PUMA’s control box at constant intervals, otherwise it locks up. 3.4 Motion Planning Our motion planning is based on potential fields followed by non-regular cell decompo- sition of the workspace. We first construct a potential field over the 3D work space. Then a skeleton topology of points furthest from all obstacles is retrieved (white area in 24 Figure 3.4). Such topology represents all free paths in the environment. Using dynamic programming over the skeleton we can find the shortest path to our goal and then con- struct a control law that goes along with the path. In Figure 3.4 we can see how this approach works for the planar case. The robot would travel on a trajectory that follows the red path which is extracted from the connectivity skeleton. Most of the work for motion planning has been described in the following references [94, 95][96, 97][98, 99]. Our contribution was to combine the three concepts above and use them. In contrast Penelope, the commercial product, uses a method based on potential fields alone. Figure 3.4: Motion planning using potential fields 3.5 Human - Robot Interface (HRI) Without doubt designing a robot assistant like the robot scrub nurse presents a domain rich in a variety of interesting research problems. As we realized from the start, our effort needed a focus area. This specialization was a necessary trade off between how complete the final design would become and how much we could further any particular research area and contribute something novel. In light of this trade off we understood that the final design would only be sufficient as a proof-of-concept. Given limited time, man power and funding it was far more important to select a key problem that would advance future studies and designs of assistant robots like RSN. One such key problem is how the robot and the human operator interact. A suc- cessful human robot interface (HRI) is essential for at least two reason. First the robot must execute its functions in a manner that is safe for the human operator. Failure to achieve acceptable system integration levels is arguably the main reason why robots 25 have not been deployed in human workspaces with the same success that they enjoy in more controlled industrial settings. Beyond immediate safety concerns the HRI is im- portant because successful introduction of new technologies depends on the technology being as non intrusive as possible. Our analysis of video and audio footage from real surgical procedures showed that instrument hand-off involves a delicate sense of touch. One example in Figure 3.5 shows the handing over of a knife. Figure 3.5: Nurse-surgeon handshake Since human nurses appear to make heavy use of touch sensing at close proximity during hand-off, we concluded that a mechanized replacement would do well to approxi- mate this condition. It is from this perspective that we picked up the problem of haptic communication and grasp planning. Coincidentally to support the haptic interface we needed to examine two additional modalities. These secondary interface modalities in- clude visual servoing and spoken language interpretation. Visual servoing represents a flexible and least intrusive method to sense the grasping process and the environment. On the other hand natural language processing exposes RSN functionality in the most convenient manner possible by forcing the technology to adapt to human communica- tion instead of constraining the human operator. While all three interface modalities have a rich body of previous work their integration is still far from trivial and their functionality for purposes of robot assistants is still not sufficient. This study presents our research on the three interface modalities and how they integrate together to facilitate a safe yet easily deployable robot assistant to automate 26 the responsibilities of a scrub nurse technician. We first go into more detail on the the three modalities and present novel methods to overcome obstacles to their better integration in robot assistants. Then, in the last chapter, we give a critical analysis of our implementation and try to place it within current and future research in this area. Chapter 4 Haptic Interface 4.1 Introduction Having described the responsibility of a scrub nurse and identified key components we have a blueprint of technical requirements a scrub nurse robot must replicate. To model the behavior of the robot we utilize a finite state automata, as it is commonly done [7, 8]. One example of such task modeling is given in Figure 4.1. Near Tray During Motion Near Surgeon Delivering Pause Returning Loaded Grasping Unloaded Releasing Move To Hand Loaded Loading Moving To Tray Unloading Unloaded Initial State Final State Figure 4.1: Behavior model This type of model can act as a context at the planning level during visual servoing, 27 28 dialog management and during grasping. Some of the events which affect the behavioral model include the following: – Near Tray 1. Robot waits for new cycle 2. Robot moves to load instrument from tray – During Motion 1. Robot executes obstacle free motion from tray to surgeon vicinity 2. Robot moves to stand-by location – Near Surgeon 1. Robot executes proximity move and detects placement of instrument 2. Robot executes a fine proximity move to provide instrument 3. Surgeon requests an instrument 4. Surgeon offers instrument back 5. Robot detects, anticipates and locates. We next present our own gripper design which was designed to provide soft touch grasping. We first explain its design and then show how it facilitates human-machine communication at the haptic level. 4.1.1 Key Contributions Our main contribution is the mechanical design of a simple two finger gripper. It uses an inflatable membrane to overcome several challenges. On of these challenges is how to detect haptic events. We introduce here algorithms for haptic signal processing. 4.2 Gripper Design We started with the simplest possible gripper which is a two finger gripper. However most two finger grippers were designed for industrial settings with fingers being made 29 from hard materials such as plastic or metal. Most grasp planning algorithms take this detail into consideration and derive grasp strategies based on Coulomb’s law of friction or (hard) point contacts. A gripper design using hard contacts can be visualized in analogy as the task of using chopsticks to pick up rigid bodies (Figure 4.2). Figure 4.2: Point contact vs surface contact At any point the two finger hard contact gripper will form two point contacts with the object. Using two point contacts a static force equilibrium can be achieved yielding a stable grasp. However the object is still able to rotate around the axis that connects the two contact points. Compare this to a grasp generated by a human hand (left upper corner of Figure 4.2). Even though two fingers are used, the hand generates a surface contact. The end result is that the grasp stops the object from moving and rotating. During our research we identified this property of the human hand as essential to our design consideration. What appears necessary is a grasping mechanism that generates soft, shape conforming and adaptive surface contacts. Such surface contact is not as easy to deal with because we have to employ a Hertzian contact model but it addresses many of the above mentioned needs. In previous work [79] we showed how the traditional grasp planning approach can be augmented to address this problem. To mimic the human hand and produce soft shape conforming grasps we designed a simple two jaw gripper shown in Figure 4.3. The mechanism has two sections. A back section is used to mount all the electronic and other equipment. On the front section we have the actuators and sensors. The loading plate separates the two sections. At the same time the loading plate also acts as a sterility barrier and a conduit for sensors. For example the central slit is used for an up-close camera sensor. The jaws too are equipped with sensors. 30 Figure 4.3: Gripper assembly (right) and its computer model (left) The lower jaw is immovable and has tactile sensors on the outside to be used during initial proximity movements. The inside of the lower jaw contains a sensor pad which is used to determine the impression an instrument makes and from that the exact orientation. The upper jaw has slits for up-close cameras and other sensors which are to be located behind the loading plate. Here we note a potential advantage a robot has over a human scrub nurse. We designed the system with two cameras in mind. One far- out camera for overall rough localization of the hand and a close-up camera for fine movements. In effect a robot features an eye which is installed inside the palm of the hand, a property which a human nurse does not have. On the inside of the upper jaw we have mounted inflatable membranes. These membranes are what generates a soft, shape conforming grasp. At the same time pressure monitors create a feedback loop which is used to control the membrane pressure. Please notice that as we generate an appropriate grasp we also have means of adjust- ing the grasp and responding to changes. Two changes in particular are very interesting. One is the change due to disturbances during movement. By detecting onset of slippage we can ensure that an instrument is never dropped. Second type of change is related to haptic communication events. Namely we can now talk in terms of events such as human subject tugging on the robotic hand because it wants to take the instrument. A very interesting case is detecting if a robot is grasping the wrong thing such as going for the surgeon’s finger instead of the instrument. 31 4.3 Haptic Communication Signaling Having examined the mechanical aspect of our gripper assembly we turn to the signal processing side. Our gripper electronic controller uses an AVR ATmega32 chip. Being of RISC architecture and running at 8MHz provides plenty of computing power right at the gripper. Unfortunately the link between the gripper and our software platform relies an USART-USB bridge which only allows 56K bps. This imbalance in speed and reaction time led us to consider an autonomous reflex behavior. In particular giving the gripper on-site autonomy for certain transient tasks so that they do not have to be micromanaged. The software platform that is further away has to tend to a number of other event sources. Locally at the gripper controller we constructed a simple PD (proportional-derivative) controller which controls the grasping jaw and membrane pressure. The rudimentary state attributes the gripper can assume are: inactive, active, jaw closed, membrane inflated, and loaded/holding. In Figure 4.4 we can observe an overview of states and their interaction. At each state the gripper might utilize a different control law. Figure 4.4: Gripper state transitions Next we present some plots which were obtained during a number of experiments to verify how the simple controller behaves. Again, initially we first tried to use a very simple positional controller but circumstances demanded that we examine the rate of 32 change in membrane pressure as well. We empirically determined all cut-off parameters used by the gripper to decide if it is holding something or not. Making this more generic is part of our future plans. In Figure 4.5 we show the membrane and gripper jaw after activation. Once ac- tivated the gripper tracks the state and our PD controller will affect local actuators such as the jaw position or membrane pressure. In active state the membrane is pres- surized. As you can see from Figure 4.5 we are using about 13 psi at rest. We use a diaphragm micro pump manufactured by Parker-Hargraves that can generate at most 16 psi. That is sufficient for the weight or shape of instruments we need. Please notice a peculiar see-saw pattern in the membrane pressure. The reason for it is a defect in pneumatic plumbing that causes leaking. The leakage would cause a state transition from ’Holding’ to ’Guarding’ and would result in the loss of the instrument. We could have eliminated the leak but instead opted to make our design flexible enough to deal with such eventuality. For that reason you can see a special ’Rescue’ event in the state transition diagram. That is also the reason why our PD controller has to consider the rate of change. Namely at rest the pressure is slowing dropping and that can not be misidentified. Once pressure drops under certain value the controller re-pressurizes. To a degree this behavior can tolerate weakening of the membrane. From the active state the two most important functions are the ability to detect when to release an instrument and when to take an instrument. In Figure 4.6 a human subject is trying to give a knife to the gripper. Around the time of the 36th second, a spike in pressure is visible. This comes from the human and is interpreted as a haptic command to close the jaw and grab the instrument. After the jaw closes, we see a slight difference in pressure as now the knife is inside the jaw. In Figure 4.7 we observe the opposite case. Here the gripper is holding something (the same knife again). Again a human attempt to pull it out is detected as a pressure variation and causes the jaw to open. The next two figures demonstrate the difference or the ability to feel. In Figure 4.8 we see the entire process of detecting the signal to take a knife from the human, holding it while the membrane is slowly leaking and then at some point releasing the knife after detecting the tugging motion. However in Figure 4.9 we see what happens when the human tries to place a finger 33 2 4 6 8 10 12 14 16 18 11.5 12 12.5 13 13.5 P re s s u re [p s i] 2 4 6 8 10 12 14 16 18 −1 −0.5 0 0.5 1 Time[s] J a w P o s it io n [O p e n − C lo s e ] Jaw Ent irely Open Pressure Restored Figure 4.5: Active state: jaw open, membrane inflated and control law on 34 35 36 37 38 39 40 13 13.1 13.2 13.3 13.4 13.5 13.6 P re s s u re [p s i] 34 35 36 37 38 39 40 0 20 40 60 80 100 Time[s] J a w P o s it io n [O p e n − C lo s e ] Jaw Open Jaw Closed Signal To Take I nst rum ent Figure 4.6: Detecting when to take the instrument 34 39 40 41 42 43 44 45 13 13.2 13.4 13.6 13.8 14 P re s s u re [p s i] 39 40 41 42 43 44 45 0 20 40 60 80 100 Time[s] J a w P o s it io n [O p e n − C lo s e ] Jaw Closed Jaw Open Tugging At tem pts Figure 4.7: Detecting when to give the instrument 17 18 19 20 21 22 23 24 25 26 27 13 13.1 13.2 13.3 13.4 13.5 P re s s u re [p s i] 17 18 19 20 21 22 23 24 25 26 27 0 20 40 60 80 100 Time[s] J a w P o s it io n [O p e n − C lo s e ] Jaw Open Jaw Closed Jaw Open Signal To Acquire Signal To Release Figure 4.8: Valid instrument: grasping a knife 35 whose profile is unknown to the gripper controller. What we see here is an autonomous response. The gripper tries to close and hold. However it detects that it is not holding the right object so it backs out and tries again. During open house events the subjective impression visitors get is that the gripper appears to know or have intelligence. The objective reality is that the gripper responds based on its current state and analysis of membrane signals. Such behavior pattern is built-in with the use of a soft deformable contact surface. 26 27 28 29 30 31 32 33 34 35 13 13.2 13.4 13.6 13.8 14 14.2 P re s s u re [p s i] 26 27 28 29 30 31 32 33 34 35 0 20 40 60 80 100 Time[s] J a w P o s it io n [O p e n − C lo s e ] Signal To Take Profile Mismatch Trying To Take attempt: I II III IV Figure 4.9: A mistake: grasping a finger Finally in Figure 4.10 we show how the pressure is changing during transport. Here we simulate what happens when the entire robot grabs an instrument at the tray, moves over a distance and then makes an abrupt stop to deliver it to the surgeon. The variation of membrane pressure in this case is due to rocking motion and should not be interpreted as a signal to release or do something else. As you can see the gripper is holding the instrument throughout. 36 50 51 52 53 54 55 56 57 58 13 13.1 13.2 13.3 13.4 13.5 13.6 P re s s u re [p s i] 50 51 52 53 54 55 56 57 58 0 20 40 60 80 100 Time[s] J a w P o s it io n [O p e n − C lo s e ] Near Tray I n Transport Near Surgeon Open Jaw Closed and Holding Figure 4.10: Transporting instrument: start, move and stop 4.4 Conclusions We presented here our work on the gripper during an attempt to build a robotic re- placement for a scrub nurse technician. Our grasp mechanism design is different and offers a number of advantages over mainstream techniques. At the moment the grasp- ing mechanism is equipped with only the basic sensors and actuators. Next steps will include improving and adding a variety of sensor modalities. Chiefly among them is vision. We would like to fuse them in a control law that will also be more generic and will not have to rely on hard-coded parameters. Our consideration of the mechanical aspect has not been as thorough as it needs to be. The leak in the pneumatic plumbing illustrates the importance of material selection. In the future we would like to examine material selection for the membrane and also for gloving so that a gripper like this one could actually be made sterile. Chapter 5 Grasp Planning 5.1 Introduction Our interest in grasp planning is motivated by the application of robotics in surgical en- vironments. In our case the robot is required to assist the surgeon during microsurgery. For the sake of precision and timing the surgeon needs to keep constant eye contact with the working area. Because of that the instruments are primarily handled by the scrub nurse who delivers them when requested. To automate this instrument delivery process, a delicate grasping tasks needs to be performed. Beyond the usual complica- tions such as obstacles in the environment this particular case also demands that grasp planning be recast as planning a grasp on an instrument already being grasped by the surgeon. In effect the grasp planning that is required is one where the target surface is only partially reachable and graspable. Motivated by this application we looked at the grasp planning problem as whole. In light of our requirements we developed a grasp planning algorithm and also a mechanical actuator. We described the actuator in more detail as part of the haptic interface. With the algorithm we have tried to stay as generic as possible and that explains why we show some results using unnecessarily more complex scenarios and gripper models. The algorithm we present here utilizes shape alignment to predict good grasps. Af- ter establishing correspondence between points on a gripper and a graspable object (the target) our method returns one or more alignment poses. Finding correspondence is made possible by our use of pair-wise shape descriptors. They are invariant to rotation 37 38 and translation and generate a localized and redundant description amenable to deter- mine partial or full alignments. We hypothesize that good alignments (as defined later) should lead to good grasps. The end-result is an algorithm that finds grasps based on on-line evidence as opposed to using off-line precomputed results and that executes the planning process in time, when needed. While this algorithm is helpful in grasp plan- ning it does not represent a complete system. Another problem which we mention here but do not extensively explore is that of characterizing the space of all degrees of free- dom also known as pre-shape selection. Within this space of gripper parameters lie all possible grasps. Because the gripper is non-linear and the space high-dimensional brute force enumeration is intractable. Now in the case of a Robot Scrub Nurse the gripper and the instruments are known ahead of time. That means in practice precomputing all good grasps is an option that makes sense. Nevertheless we looked into the problem and describe one possible approach that comes from machine learning community. In this approach we model the act of grasping as a Gaussian process and use a database of proven grasps to train the model so that it can predict which DOF pre-shape goes with any given object. 5.1.1 Key Contributions Our primary contribution of this chapter is the grasping method. The grasping method uses pairwise features to establish geometric alignment. We study how the geometric alignment affects the quality of grasps. To support our method a novel object alignment method is also presented. This method is based entirely on quaternions. Its advantage over mainstream method lies in the lower complexity while maintaining same accuracy and speed. That is a suitable characteristics for embedded environments such as robots. 5.2 Problem Statement In this section we present our problem formulation. We outline conventions and as- sumptions which we use throughout the paper. Given a gripper G(~π) and a target object T we seek to find a set of optimal param- eterizations ~π that yield good grasps. The gripper “G” is modeled as a triangulated mesh (Figure 5.1) and parametrized 39 by ~π whose dimension equals the number of degrees of freedom in addition to the three dimensions for wrist position and to the three for wrist orientation (6D+DOF ). Furthermore “G” is subdivided into “N” fingers with each finger having “M” links. The mesh surface of each link is partitioned into active facets and inactive facets. The active facets are commonly on the inside of the fingers and only they form contacts with the target. While not strictly necessary we also observe that the active facets usually subtend convex shapes [100] also known as grasping convex. Figure 5.1: Gripper sampling The target “T” is also modeled as a triangulated mesh. The mesh is partitioned into occluded and free facets. It is the task of the sensor to paint the facets according to this partitioning. What constitutes a good grasp depends on the application. Frequently a stable grasp is a good grasp but a good grasp might be the one that is sufficiently stable while providing the surgeon the opportunity to also grasp the object in a stable manner. We define an objective function Q(G(~π), T ) that is at maximum for an optimal value of π. Given the grasp quality function our primary goal can be reformulated as ~ˆπ = argmax π Q(G(~π), T ). (5.1) During a grasp, active facets of “G” are in contact with “T”. To limit the complexity we sample a set of “P” contacts {p1, . . . , pP } evenly over the facets. The set of these P samples will be called a constellation. For each of these contacts a set of valid forces within the friction cone can be expressed as an approximation over the “R” edge vectors {d(pi)1, . . . , d(pi)R} of the cone: 40 f(pi) = R∑ j αijdj(pi) (5.2) 1 ≥ R∑ j=1 αij . (5.3) Here f(pi) denotes a contact force at point pi, and αij are non-negative coefficients. Figure 5.2 illustrates this approximation step. pj dj1 dj2 dj3 dj4dj5 Figure 5.2: Friction cone If the sum of all forces acting on a body is zero, that body will not move but it might rotate. We therefore consider both forces and torques. The contact wrench is defined as w(pi) = [f(pi), pi × f(pi)] T . (5.4) The overall wrench acting on “T” is then wT = P∑ i=1 R∑ j=1 αijwij . (5.5) where wij = [dj(pi), pi × dj(pi)] T . With the above derivation which follows [43] the target is in force closure if 41 0 ∈ interior(convexHull [w11, . . . , wPR]). (5.6) One common quality measure is the minimal distance of the origin to the surface of the convex hull. Q = min ~w∈W ||~w||. (5.7) We will make use of it and in addition consider only painted facets on the target as reachable. Finally in the end a ranking metric can be used to order and consolidate alignment and grasp metrics. 5.3 Proposed Method A robotic assistant in microsurgery must be able to grasp objects held by a surgeon. All other issues aside we need a grasping method that will run in real-time and is capable of planning grasps over partially covered objects. Our proposed method is inspired by the use of pairwise features in computer vision. We observe that the search for a good grasp necessitates an efficient identification of correspondence (which finger goes where). The same problem is also present in computer vision and object alignment. Another inspiration has been the work done with the GraspIT simulator. By studying it one realizes that their use of preshapes and heuristics to obtain gripper approach vectors is an attempt to simplify the problem of how to align the gripper to the target to obtain a stable grasp. This is where we contribute novel work. For a given DOF parameterization of the gripper we find optimal geometric alignment between the gripper and the target. By maximizing the contact surface between the two we should obtain a subset of poses leading to a good grasp. In order to match 3D object of different parameterization we utilize pairwise feature vectors. These features are invariant to translation and rotation and therefore simplify the pose estimation problem. Please notice that unlike in vision we do not need more invariance and so designing the features is easier. Given an “N” finger gripper our method finds partial and full grasp alignments and works even if certain parts of the target object are occluded (i.e., surgeons hands). 42 Direct geometry alignment is an expensive proposition. Partial matching makes it even more difficult. Our method lifts the representation of T and G into an invariant space. The only requirement is that the facets we match be normalized to have approximately equal size. Such unit facet will ensure that we test the target face multiple times if there is room for the finger to move. The end result of alignment is to obtain the orientation ~R and the position ~P of the gripper wrist. Since multiple results are possible, we rank the results by means of an alignment quality metric. The quality metric we choose considers average distance within the estimated pose and the number of votes or correspondences reporting it: QA = V oteCount 1 +DistanceV ariance . (5.8) Figure 5.3: Design of features We now describe the construction of our feature vectors. As illustrated by Figure 5.3, we consider two unit facets sampled over the active surface of the gripper. Each facet will have a position pi and a normal vi. From this information we compute three quantities. The distance dij between two facet positions: dij = ||~pi − ~pj ||. (5.9) The cosine similarity φij between the normals ~vi and ~vj : φij = ~vi · ~vj . (5.10) As the third quantity we compute the cosine similarity between the normals and the line between the points pi and pj : βij = ~vi · (~pi − ~pj) ||~vi||||~pi − ~pj || . (5.11) 43 We now construct our feature vectors as: eij(π) = (dij , φij , βij + βji) T . (5.12) Note we could have kept βij and βji separate which would give us a directional feature. However when they become equal we loose the directionality and have to deal with ambiguity, so we assume the ambiguity is there by design. To find all possible combinations of “N” contacts of “G” over the geometry of “T” having “V” vertices including the partial matches we consider both “G” and “T” as point sets (centered on unit facets). Then the size of the search space is given by NC = N∑ i=1 ( V i ) . (5.13) if we sample the gripper active surface once per finger (i.e., tip of the finger). Our method obtains V (V−1)2 pairs for a target object “T” with “V” unit facets. For a gripper “G” sampled at “P” unit facets across all fingers we obtain P (P−1)2 pairs. One way to do pairwise matching is all against all. A more efficient way is to build a volume hierarchy of the target using Axis Aligned Boxes (AAB). For a well balanced binary tree the size of the search space becomes: C = ( ln V (V − 1) 2 ) × P (P − 1) 2 . (5.14) 5.3.1 Pose Recovery Once pairwise correspondences are found, we estimate orientation first and then posi- tion. When we say we have correspondence it also means we have two unit vectors ~A and ~B that are related by a rotation. ~B = Rq(θ) ~A. (5.15) A naive approach for orientation recovery would be to pick the three vectors a cor- respondence gives us (two normals and a difference) and invert the rotation matrix. We can obtain two possible rotations this way because we do not have point correspondence. However this requires matrix inversion and provides no means of dealing with singular cases (i.e., where two or more vectors are co-linear). Our approach is more efficient in 44 terms of computation and stability. We find six rotations from a single correspondence of which three should agree. To explain our approach we try to visualize rotations in Figure 5.4. Any rotation can be expressed using unit quaternions or versors. A versor encodes an axis of rotation and an angle. For example in figure 5.4 one axis of rotation goes through the south and north pole ~N . A set of rotations is depicted as a thick arc on the equator. One would think that to recover the rotation we could simply take the cross product of ~A and ~B. The problem is that an entire set of of versors can rotate A into B. Any quaternion lying on the great circle that travels half way between ~A and ~B will do the job. The versor having axis ~N rotates our points using the smallest angle. Versor ~C is the other extreme and rotates by 180 degrees. A pairwise correspondence gives us three vectors. For each correspondence we can register a great circle on a spherical map and then pick pairwise intersections with the most votes. The problem is that spherical to cartesian mapping is not uniform. In addition we would need two coordinates for the axis and one for the angle requiring a 3D accumulator and it is well known that Hough transform suffers from artifacts when the bins are too large. As an alternative we go a step further and actually find intersections between two great circles. Accounting for directional ambiguity pairwise correspondence gives us six vector pairings. Using these vector pairings we can verify versor agreements in angle magnitude. In case of singular cases which are unconstrained we revert to using the axis with the least angle (cross product). First we observe that the set of all versors rotating A into B can be expressed as: ~n = cos(t) ~N + sin(t)~C. (5.16) The intersection of two such great circles is given by: ~mij = ( ~Ni × ~AiBi)× ( ~Nj × ~AjBj). (5.17) In Figure 5.4 this axis is shown as the intersection between the red and black equator line and going through intersection points X ′ and X ′′ . To test if two circles really intersect at that point we must also check that the angle amounts are in agreement. For a given intersection point we calculate two angles one for each A,B pair: 45 A B N=AxB D=NxC C=A+B Figure 5.4: Pose recovery θi = ∠( ~Ai − ( ~Ai · ~mij) ~mij), ( ~Bi − ( ~Bi ~·mij) ~mij) cos(θi) = [~mij × (~mij × ~Ai)] · [~mij × (~mij × ~Bi)] |(~mij × ~Ai)||(~mij × ~Bi)| . (5.18) Please notice that two great circle intersect at two antipodal locations and both must be considered as valid. This also impacts how we determine if two angles are in agreement since the same rotation results if angles and axis are negated. We can eliminate axis ambiguity by choosing only upper hemisphere and flipping the angle appropriately. After that the sign of the angle only depends on the angle between the found axis ~mij and the cross axis ~Ni. The position of wrist is obtained in a second pass from the midpoint of the cor- respondence (btw constellation points). Given an identified rotation ~q = (θi, ~mij)Twe rotate the midpoint forward in gripper frame and then subtract it from its position in the target frame: ~P = ~PT−Rq(θ) ~PG. (5.19) In the end we could theoretically end up with more than C possible rotations. If that happens, we most likely do not have a good alignment. If we find correspondences for x contacts, then 3[x(x− 1)/2] rotations should agree on the same pose. 46 5.3.2 Practical Consideration In Figure 6.6 we present the pseudo code for our algorithm that summarizes our method. Tes s e l a t e T in to un i t f a c e s PT←ComputePairs (T) BVT←ComputeBoundingHierarchy (PT) Te s s e l a t e G ac t i v e su r f a c e in to un i t f a c e s foreach pi in (DOFSpace) PG←ComputePairs (G( p i ) ) foreach gPair in (PG) bPairs← gPair i n t e r s e c t BVT foreach bPair in ( bPairs ) q←EstimatePose ( bPair , gPair ) p← RecoverPos i t ion (q , bPair , gPair ) VoteFor (q , p ) in Q RankPose (Q) PrunePenetrat ion (Q) ComputeGraspQuality (Q) Figure 5.5: Pairwise feature alignment planner The structure of selected pairwise feature is where a lot of the application specific engineering occurs. For example directionality can be desirable. Another desirable property is that the facets we try to align be of approximately same shape and size (compatibility). Measures such as facet area or facet aspect ratio come to mind. For this paper we preprocess both surfaces by tessellating them into approximately equal triangles which we call unit facets. The next practical issue is the distance metric used to detect intersection of features, intersection of orientations and intersection of positions. For the recovery of position given an orientation we use the euclidean distance. The intersection of features is a bit more tricky. The feature elements could have different domains such as distances and angles. We chose to use a weighted distance metric and picked the weighting parameters by manual introspection. For orientation distance measure we opted for angle measurements. For example rotation (θi, ~mij)T is close to (−θi,− ~mij)T and so is (π, ~mij)T and (−π, ~mij)T . We present the algorithm as a nested loop to emphasize that it is local and sequential in nature and amenable to parallelism. In practice however we converted the loops into three passes. Pass one would accumulate correspondences, pass two would accumulate 47 rotation agreements and pass three would recover position given rotation. Post-processing involves actual validation of alignment against the grasping kine- matics and dynamics. Besides the actual grasping simulation we also perform inter- penetration testing. We picked a very simple feature to capture shape characteristics and it will not guard against all eventualities. Figure 5.6 shows an alignment that is impossible for rigid gripper and rigid body. Figure 5.6: Case for penetration test 5.4 Results The experimental validation of our method was performed in octave using OpenRAVE as the back-end server. OpenRAVE shares many of the features with GraspIT but appears to have larger audience. Besides integrating with the ROS (Robot Operating System) framework it allows scripting in Matlab/Octave, Python and C++. We performed our experiments in Octave to take advantage of computing facilities. In hindsight, after having implemented most of the numeric routines to support side-by-side comparisons, we should have used Python as it allows better structuring of code while still maintaining the benefits of a duck-typed high level interpreter language. To evaluate our grasp planning method we looked at several characteristics. We ex- amined how our approach to object alignment compares to already established methods. We also examined how surface sampling density affects the accuracy of the alignment process. Finally in the main experiment we looked at how geometric alignment is related to grasp quality. For the main experiment we selected two grippers and several objects and performed 48 grasping tasks over the entire DOF space. For grippers we selected the model of our own gripper (Figure 4.3) and that of the Barrett hand (Figure 5.1). Our own gripper represent what we would use in practice, and that is a relatively simple 1-DOF actuator. The Barret hand is a more generic and more complicated 4-DOF actuator but still a modest model in comparison with the human hand, which requires 20 or more degrees of freedom. As for objects (Figure 5.7) we selected some that are rather generic but complex with high polygon count and others which actually show up in day-to-day operation of a Scrub Nurse robot. Here again we notice that surgical instruments for most part are simple almost two dimensional shapes. The fact that we do not need a Barret hand and most instruments in Otolaryngology are flat simply means that our practical setup for Scrub Nurse is simplified. However using this initial simplicity to narrow down the problem would be ill-advised as a case of premature optimization. We mention this as a justification for our attempt to stay generic throughout this study. Please note that the issue of sampling arises in two contexts. In the main experiment we obtain the results by sampling evenly the DOF parameter space of the gripper. This top most process reflects the search over all gripper parametrization. Another context where sampling is mentioned is the representation of gripper and target as point clouds. In the main experiment the gripper and target are sampled randomly and uniformly. For the gripper we made sure that some samples came from each of the finger links. At each DOF sample we pick the highest alignment geometrically speaking and then evaluate what grasp quality we get. The grasp quality evaluation is based on equation (5.7). Configurations that clearly penetrate the object were disqualified. For the rest we would perform a fine tuning step during which we slightly perturb the wrist position and DOF values (within the sampling step). During the fine-tuning the best grasp quality is retained. The majority of our computation occurred on the client side but collision checks and link transformation states were queried from the simulator. We manually performed the target facet painting where it was feasible. For example the flask has a region which is unreachable at the top of the neck. 49 Figure 5.7: Selected targets of varying polygon count 5.4.1 Pose Alignment Validation The grasping method we present here relies heavily on geometric alignment. In that regard we have developed an alternative to existing methods. Our goal was a sim- pler, faster and possibly more accurate method than what is already available. The motivation came from our limited computing resources. To ascertain how our method compares with existing solutions we considered a number of proven alternatives. In Figure 5.8 you can review the results that were collected. For comparison we selected two widely used 3D-3D alignment methods by Arun and Horn [77, 78]. We first exam- ined how these methods performed globally when provided with a point cloud. Then we embedded them together with our method (PXT for pose intersect) in a RANSAC process (stochastic best-wins sampling). The RANSAC algorithm returned the best of ten hypotheses. We also had access to the ICP (iterative closest point) algorithm but its accuracy was worse than the others,it would frequently not identify the right point correspondences even for slightly perturbed clouds and its running time was significantly higher. It was therefore not included. All methods were subjected to increasing amount of Gaussian noise. The method labeled as ”hypVoid” is there for reference. It shows the error we obtain if the hypothesis is drawn from random translation and rotation. From it we see that at around 0.5 variance the results become useless. As we can see there is no systematic advantage of one method over the other. Both global methods perform almost identically. As expected, they yield better results, but our experiment assumes perfect correspondence and only level of noise varies. Compar- ative analysis of the runtime also shows that no method significantly outperforms the 50 others. One important point to make is that while we implemented our method entirely in Octave the other two methods make heavy use of optimized code for eigenvalue and singular value decomposition. We were tempted to use this as a sign that our method could be faster in addition to having lower complexity. However, after implementing all three methods in C++ we found that the timing did not change much. One reason is that in RANSAC setting both Arun and Horn method work with the minimal number of points and any complexity arising from use of EIG or SVD algorithms appears minimal. 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 2 2.5 3 3.5 Noise Level (Variance) Co m bi ne d L2 E rro r ( R & T) hypVoid hypHorn87 hypArun87 hypRansacArun hypRansacHorn hypRansacPXT Figure 5.8: Alignment error 5.4.2 Selection Strategy of Pairwise Features for Grasp Planning For the alignment of 3D objects direct methods require knowledge of point correspon- dences. Other methods like ICP (Iterated Closest Point) can work without this informa- tion but since they solve two problems simultaneously there is a drawback. We belong to the first group. To help establish correspondence we make use of pairwise features but there is a problem. For a gripper with N points we obtain N(N − 1)/2 features. Similar number of features compute from the target and matching them becomes a problem quadratic in N . Also many of the features are redundant. Naturally, the ques- tion of selecting salient or good features imposes itself. To that end we formulated three 51 different feature selection strategies and compared them against the full all against all feature matching. In figure 5.9 we measured the alignment error against the size of selected features. We picked one of the targets and created a point cloud by sampling all facets into equal size. It was then rotated and translated by a random amount to represent create an alignment problem void of outliers. The original point cloud was always sampled uniformly with increasing sample size while the rotated cloud was sampled using one of the presented strategies. The features obtained this way on the two point clouds were then used to recover correspondence and then the pose. The feature selection strategies include: Completely, Randomly, Most Distant Feature and Most Orthogonal Feature. The complete selection strategy is to use all features. The random strategy samples randomly over the entire set. The most distant feature selection picks features that have the largest radius to the closest neighbor. Finally most orthogonal pair is an attempt to select features that have most orthogonal position and normal. As you can see the results show that selecting only a subset of all pairwise features is definitely justified. However our results show that neither of our heuristic methods significantly outperforms the random case. 0 0.05 0.1 0.15 0.2 0 100 200 300 400 500 E rr o r Randomly Most Distant Feature Most Orthogonal Pair Number of samples Completely Figure 5.9: Feature selection strategies 52 5.4.3 Grasp Planning Results We have already discussed algorithmic complexity and here we present two results that we feel support the premise of this paper. First in Figure 5.10 we show a projection of the gripper and target constellation in feature space. We have plotted side by side constellations of a number of grasp configurations (case 2-case 6) and contrast that with the constellation of the target (case 1). We used Multi Dimensional Scaling (MDS) to visualize the data as it is commonly done in machine learning. The intent of this experiment is to demonstrate that the gripper constellation has a degree of separability from the target constellation and that this degree depends to some extent on the parameterization. Simply put we have proposed here a method that quickly finds gripper to object alignment but we still have the problem of sampling the DOF space and searching it. As Figure 5.10 demonstrates there is separability, the search process can thus be optimized. If we can’t reduce complexity then at least we can ensure that the least amount of time is spent at each point. −1.5 −1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 1.5 2 Case 1 Case 2 Case 3 Case 4 Case5 Case 6 Figure 5.10: MDS Projection of gripper and target in feature space The second result follows in Figure 5.11. It establishes by empirical means the relationship between gripper-target alignment and the resulting grasp quality. The reasoning here is that if there exists proportionality 53 Figure 5.11: Quality of alignment vs quality of grasp between the two, we might be able to cast part of one problem in terms of the other as we have attempted to do here. While this transitivity might not have any benefits, it does provide one extra tool in pursuing the grasping problem. As can be seen from this figure that relationship is not as clear cut as we initially assumed. Instead of clear proportional relation we get a more measured behavior. It appears that a good grasp need not be optimally aligned but at the same time if we have a good alignment our chances of obtaining a good grasp are better. The results seem to suggest that a good alignment could indeed be used a heuristic in the search for a good grasp. Lastly, Figures 5.12-5.14 present a qualitative analysis of our work. They show the top grasps found by the method described here on the three sample targets. As we can see each grasp indeed agrees with intuitive expectations of a good grasp. Figure 5.12: An example of alignment grasps 54 Figure 5.13: An example of alignment grasps Figure 5.14: An example of alignment grasps 55 5.5 Conclusion We have presented here a method that addresses a sub-problem of the grasp planning problem. The sub-problem arises from the high dimensional search space in which a grasp planner must operate to find the approach direction and position of the gripper wrist relative to a target. Our initial assumption was that good grasps require good alignment. While we have not tried to reduce the search space spanned by the degrees of freedom (DOF) we have explored the separability of a gripper and a target in the feature space. The idea is that absent any means of reducing the DOF space we could construct a classifier that might quickly eliminate certain regions. This research was performed within the specific context of a human robot interface (HRI) needed in medical robotics. We believe that the majority of grippers used to- day including the Barrett Hand utilized here are rigid tools and as such more suitable for industrial tasks. Even if soft rubber fingers are attached the active surface is used passively and is blind. Robots capable of human interaction should feature soft, de- formable active surfaces. For one the softness prevents damage to the human subject (hand). Additionally the soft surface provides for embedding of sensor matrix needed for the increased dexterity requirements. It also generates shape conforming surface grasps. In this light even “overgrasps” such as Figure 5.6 can be considered as mildly valid. Future work will involve direct field studies in the operating room during which we intend to record how the surgeon interacts with an assistant at the haptic level (since his eyes are on the microscope). Besides collecting evidence for a touch based com- munication protocol this field study will also examine how the instrument is obscured (partial coverage) during the interaction. The results here and from such field studies will allow us to complete a novel shape conforming gripper for HRI being developed in our lab. Chapter 6 Visual tracking 6.1 Introduction In robotic applications vision is not the only way to sense the environment. Some of the other modalities include laser, radio-frequency (RF) tags, tactile sensing or sonar. Each modality performs differently in terms of accuracy, range or level of intrusion. A visual tracking system offers high versatility and accuracy. Furthermore, it is not intrusive to humans. For best adaptation rates a novel technology needs to introduce minimum on interference. For example one of the biggest obstacles of enabling robots like Da Vinci [1] is that they stand between the surgeon and the patient. In this position the robot undermines the surgeon’s hand-eye coordination, arguably their most important skill. The visual system we have in mind would track the hands of the surgeon and that information would be used for accurate and safe instrument delivery by the robotic nurse. This chapter is a study of how well existing methods perform and how they can be reliably integrated. We hope that these findings will be helpful in advancing future efforts to implement vision in assistive robotic settings. We examine both on-line and off-line tasks that are needed. The on-line tasks are signal processing operations that occur during robot use. Their purpose is to track an object. We assume that we have a suitably placed and calibrated camera, a feature detector that extracts essential statistics from each camera frame and a 3D model of the hands. Operating over the extracted features the first step is to identify correspondences 56 57 between image points and points on our model. After that we perform depth recovery to obtain 3D points in the camera frame. The final step is to align these points with the model frame. The recovered pose consisting of rotation and translation is then available to the robot motion planning to calculate a trajectory for instrument delivery. To achieve robust on-line tracking we must select good features to track. This se- lection process takes several steps and is done off-line. How we accomplish these steps frequently impacts the quality of the on-line process. A naive approach would be to manually collect parameters such as the intrinsic camera parameters and measurements of the model and then to assume that the same feature can be found from one frame to the next by using their proximity. We go a step further and examine how these tasks could be automated. We pose the task of feature selection as a clustering problem. We assume the number of clusters to be equal to the number of points being tracked. Each cluster is constrained to contain no more than one feature from every frame. Our goal is to retrieve clusters containing features that are in correspondence to the same point on the 3D model. The features are clustered across frames using their proximity in feature space. We call this task feature selection even though in computer vision the term is sometimes used to describe feature extraction. In our case we are given a feature detector such as SIFT [63] or SURF [62]. Our clustering in effect provides a tracking for each point simultaneously. Based on the estimate of the within-cluster feature variation we choose the most reliable points to track. We next investigate structure from motion methods and how they recover 3D shape from 2D observations with known correspon- dence. These methods make it possible to extract model description driven by available observations. In the best case we could skip manual measurements on the real object. A positive side effect of the projective reconstruction is that it also returns intrinsic camera parameters. After outlining and testing the essential tasks above we then proceed to evaluate the overall performance for a number of different tags that could be attached to the back side of the surgeon’s hand. The results can then inform our ultimate selection of markers used for the operating room. 58 6.1.1 Key Contributions The main result of this chapter is a vision system with a novel clustering method to select good features. The selection process is also supported by specially designed marker tags. Having demonstrated good tracking performance, the features obtained by our main method are then used to reconstruct 3D models in the training phase. The method to accomplish this task was specially designed for cases of missing correspondences. That is to say, in our case a matrix of 2D point correspondences will have missing values that need to be recovered in tandem with the 3D reconstruction. 6.2 Problem Statement In the field is Otolaryngology a common procedure is the operation of the inner ear. The surgery is performed under a microscope. The robot would replace the scrub nurse technician. Its primary function would be to manage instruments laid out on the table and deliver them from and to the surgeon. The instrument delivery is initiated by the surgeon but driven by the scrub nurse. The scrub nurse needs to find the stretched out hand and place or take an instrument. That is the reason why a simple 2D vision system is not sufficient [4]. We need a vision system that can find and track the pose of the surgeon’s hands. The operating room is a chaotic environment and occlusions are possible, especially coming from the microscope stand. We will assume that sufficient processing power is available for a high fidelity stationary camera. Other possibilities would be to mount the camera on the robot’s hand, use multiple cameras or a moving camera. The human hand presents a challenge for tracking. It is a complex quasi-rigid body consisting of multiple joints that can go in and out of occlusion. Parts of the hand are actively used by the surgeon and others could be engineered to hold special tags. The entire glove could be marked with patterns but during operation blood will cover up parts of it. For simplicity we decided to attach special geometric markers on the flat back side of the hand and consider that for tracking. Using flat segments could be extended to the back-side of fingers and even the palm or the wrist. Ultimately, this tiling approach can be generalized into a full deformable body treatment. Besides simplifying the approach, another advantage we might have is the sterility. The surgeon and the scrub nurse are 59 sterile. The other staff or objects might not be. In that regard an operating room is a controlled environment. Only sterile objects might come in-between the surgeon and the instruments. That should cut down on the sources of occlusion. While not the primary goal, the vision system should be extensible to allow for other functions such as obstacle detection and instrument identification. 6.3 Proposed Method In order to handle instruments a robot assistant must reliably identify and localize the instruments. The task of identification can be addressed by special RF tags, magnetic strips or coloring. We can precompute ahead of time instrument location on the tray. This leaves, as the only big problem, the localization of the instrument when it is held by the human subject. The robot must be able to estimate the orientation and position of the instrument so that an appropriate approach vector can be computed. Both of these tasks occur when the gripper is in close proximity to the instrument. Because of that we can utilize a variety of sensors such as tactile, magnetic, or pressure sensors. Before the proximity sensors can be used there still remains the issue of moving the robot closer. This problem requires the robot to localize the human hand and the tray from afar. Some sensor modalities commonly used in industrial settings such as lasers are not suitable because of human presence. For that reason the most ideal sensing modality would be vision. Our vision system follows well established patterns from available literature. In Figure 6.1 we present the main vision pipeline. We engineer the environment to the extent that we attach special tags imprinted with geometric patterns to the back of surgeon’s hand. Of course the assumption is that in the first step we need to localize the hands which need to receive the instrument. Given an image captured from a camera we extract suitable features. Initially we considered projective invariant features. One such invariant is the cross ratio [67, 101, 102, 103, 104]. We have spent considerable time pursuing this idea without much success. In the end we reverted to more proven features like SIFT and SURF. Any other feature could be substituted easily. A good feature would be highly discriminative and as robust or invariant to changes in viewing 60 conditions as possible. In the next step we use the discriminative property of features to find correspondence between a 3D point on a target model and the observed image evidence. Using the so obtained correspondence and geometric information from the model we can then recover 3D depth of each registered point. The methods for depth recovery have been studied extensively for the past 20 years and some of them have been known in photogrammetry for the past century. In our system we use three points of known correspondence to recover depths. After depth is recovered the last step is 3D alignment of point clouds. This step returns the final deliverable which is the translation and rotation of the target model. Feature Extraction Registration Figure 6.1: Computer vision pipeline The main pipeline might have some issues with robustness but given the right con- ditions it is accurate and all methods that are needed are available and have a rich body of literature to justify their use. Where we encountered the real problem is in target modeling. The problem is that we need to identify high-quality or salient features of the model. Frequently the image operations that produce features like SIFT or SURF are analytically intractable which means we can not predict what part of the model will carry features that are robust and can be tracked. So acquiring the right model with the best feature became the real issue to resolve. In Figure 6.2 we illustrate our existing answer to that problem. We film the target, in our case surgeon’s hands, and collect a stack of video frames. From each frame we extract any number of features of some type. Then a minimization loop is entered. In this loop we try to cluster features across different views that might correspond to the same point on the model. Having established correspondence between 2D points over a number of views we can use projective reconstruction methods to 61 Same Target Different Views Feature Extraction Images Cluster Corresponding Features Reprojection Error Minimize Error Confidence Projective Reconstruction Figure 6.2: Computer vision model acquisition estimate position and orientation of each view. There are two issues with the loop. Some features are not very robust and might disappear. In addition some features might lack the discriminative power and get assigned to the wrong point. Both of these issues lead to errors in projective reconstruction. Hence the need to iteratively eliminate certain features and reconstruct point correspondences. Once the reprojection error has dropped sufficiently we are left with a number of surviving feature clusters. Each cluster represents a point on the 3D target model each feature in the cluster represents a different view. This surviving clusters are finally selected as our target model. Later in the actual vision pipeline we use nearest neighbor search to match a new feature with a feature from the target model. After that we query the cluster id to identify which 3D point is represented. That 2D-3D correspondence is then what is needed for the final stage of the vision pipeline as explained above. 6.3.1 Assumptions As explained in the problem statement we assume a stationary camera of sufficient fidelity. The camera model is the ubiquitous pin-hole camera: pimg = Π(cam,PT ). (6.1) It projects 3D points PT from the camera frame onto an image plane. The projection mapping Π is non-linear in general. The linear parameters are the focal point and image 62 center fx, fy, cx, cy. Additional parameters are the radial and tangential distortions k1, k2, k3, p1, p2. We assume the hand is represented by one or more planar patches that yield sufficient number of features and are embedded on a rigid body. The 3D model is represented by a cloud of points where the frame is aligned with the wrist of the human hand. Pmodel = {P1...PO...PN}. (6.2) If both hands are allowed in the image, then they will use different tag patterns. The patterns on the tags are singular in appearance from anything else in the scene. 6.3.2 Features We decided to use two popular features called SIFT and SURF. Both are represented by an image position,scale,orientation and a 128-dimensional feature descriptor used in our distance computations. Fx,y,σ = {x, y, σ, dir, ~f}. (6.3) These features are obtained by computing statistics such as histograms in the vicinity of a key point. We chose them because they appear to satisfy most of the requirements for a good feature. The features should ideally latch onto a characteristic of the hand and change little under different viewing conditions. The lighting conditions in an operating room are quite controlled and the projective effects are the most prominent variation. A feature should also be local to avoid drifting when parts of it are occluded. Finally, a feature should be unique and distinguishable to enable easy establishment of correspondence. Please note a very important observation about features. Since we want to track singular patterns, we expect our features to be singular. If a feature arises based on a repeating pattern, it will show up multiple times in the frame. Its radius to the closest neighbor will therefore approach zero. This observation about the minimum inter-frame radius of a feature is exploited later on. Namely, we use this radius in testing for overlap between two features or between a feature and a cluster. 63 6.3.3 Depth Recovery Assuming we have 2D-3D point correspondence, the next step is depth recovery. With- out correspondence this stage suffers from combinatorial issues. By shifting responsibil- ity for correspondence to the features, we make this step much easier at the expense of the feature. For depth recovery we primarily use the three point method for its simplic- ity [71]. The three point method is based on the law of cosines. As illustrated in figure 6.3, given three lines x~a, y~b, z~c emanating from camera origin O and going through 3D points A,B,C with side lengths d1, d2, d3 the actual depth of the 3D triangle is recovered by solving three equations in three unknowns: Figure 6.3: Recovering depth from three points d21 = (x~a− y ~b)(˙x~a− y~b). (6.4) d22 = (x~a− z~c)(˙x~a− z~c). (6.5) d23 = (y ~b− z~c)(˙y~b− z~c). (6.6) There are multiple solutions to this system. We pick the solution for which the covari- ance of the triangle sides is at the maximum. maxCov([A−B,A− C], [x~a− y~b, x~a− z~c]). (6.7) 64 6.3.4 Pose Alignment Once we have recovered depth in camera frame, we proceed to align the points to our model frame. The alignment process yields a rotation and a translation that explains the pose of the camera or the object when the image was taken. The pose alignment problem recovers the rotation R and translation T which transforms a cloud of original points (model frame) to a cloud of transformed points (camera frame) as follows: PT = RPO +T = [R, T ]PO. (6.8) In [79] we introduced a method that recovers the rotation (from which translation follows) given only two correspondences i, j. Recalling that all possible rotations from Ai to Bi are of the form: ~n = cos(t) ~N + sin(t) ~AB. (6.9) The intersection of two great circles over the unit sphere of rotations is then given by: ~mij = ( ~Ni × ~AiBi)× ( ~Nj × ~AjBj). (6.10) If the angles for both correspondences i, j are close or equal we have recovered pose by intersection. θi = ∠( ~Ai − ( ~Ai · ~mij) ~mij), ( ~Bi − ( ~Bi ~·mij) ~mij). (6.11) 6.3.5 Selection of Good Features Before we can proceed with tracking, we need a means of establishing correspondence between image frames. Frequently in literature this step is overlooked and commonly the features are tracked as long as they are close spatially to each other. We pose the problem of finding good features to track as a clustering problem, whereby the number of clusters is assumed equal to the number of points tracked. Each cluster is assumed to contain no more than one feature from each frame. The resulting clusters provide us with a means of establishing correspondences. The advantage of this approach is twofold. First, it does not assume that every feature is necessarily present in every frame. Secondly, by analyzing the clusters one can obtain better insight as to how invariant and how discriminative the used features truly are. 65 Methods which simply use spatial proximity typically rely on the nearest neighbor. To illustrate potential problems with such methods, we present two examples in Figure 6.4. The lower example is what happens to an observation (right frame) as we try to identify closest neighbors in the model (left frame) using nearest neighbors in feature space. Clearly we must be able to reject an observation and not try to assign all of them. The upper example shows what happens if we try to infer a distribution over the model features by means of a radius of rejection. In this case the radius of rejection is given by the distance to the closest neighbor. Granted we obtain fewer correspondences but the quality of the result is much improved. The goal in our approach is Figure 6.4: Feature correspondence (with and without radius rejection) to cluster all features across a stack of frames, if they correspond to the same point on the model. Our clustering scheme 6.6 starts by computing the minimum radius for each feature within a frame, which is given by the distance to the nearest neighbor. This radius can be considered as a crude estimate of the underlying cluster given only the current frame. It will be almost zero for any feature that is repeated and greater than zero for distinguishable features. We repeatedly try to grow clusters until we have assigned all points to a cluster, this is accomplished by iterating over all frames. Each cluster is grown by adding the closest feature that also overlaps the cluster. The overlap is determined by considering the the values and radii of the features respectively. A 66 feature overlaps the cluster if it overlaps any of the cluster points already added. We observe that for singular features there can only be one feature per frame at most. We modify the cluster to include only one feature from any given frame. Finally, we note that as a cluster grows, it might find the closest point to be already in a different cluster. In that case the point is moved to the new cluster and the old cluster is reprocessed in the next cycle. Our clustering scheme is illustrated in Figure 6.5. As stated, we are trying to connect features across multiple frames by testing for spherical overlap. The cluster grows by adding points using the minimum distance. This process gives rise to minimum spanning trees. On the right side we see that any new point is added because it was closest to an older member of the cluster. We record for each cluster member the minimum distance that was used when determining whether or not to include it. This distance is later used to decide if a point should be moved to another cluster. If a point is moved to another cluster, any points that were added because of it will also end up in the new cluster and therefore maintain the minimum spanning property. Figure 6.5: Clustering process (across frames and within cluster). 6.3.6 Projective Reconstruction Which point on the model is tracked depends, to a great extent, on the best features selected by the clustering stage. This fact might preclude us from acquiring the model description ahead of time. Instead, the selected features will dictate what is relevant. This situation presents an opportunity to automate an error prone and manual task. At issue is the prospect of measurement error for each and every point that is recorded by hand. Projective reconstruction methods offer a solution here. The reconstruction methods also have the benefit of providing us with camera calibration parameters. At 67 po in t s . c l e a r ( ) c l u s t e r s . c l e a r ( ) # not a l l f e a t u r e s appear in every frame for f e a t u r e in f rames : f e a t u r e . computeMinRadius ( ownframe ) f e a t u r e . s e tC lu s t e r ( nu l l ) po in t s← po in t s ∪ f e a t u r e done←False while ( done 6=True ) : # seed c l u s t e r with one po in t or r e s t a r t f i r s t open c l u s t e r← seedNewOrOpen ( ) i f c l u s t e r i s Empty : done←True c l u s t e r . c l o s e ( ) while done 6=True and c l u s t e r . isGrowing ( ) : next← c l u s t e r . f i n dC l o s e s t ( ) i f hasOverlap ( c l u s t e r , next ) = False : sk ip prev← next . g e tC lu s t e r ( ) dNew← c l u s t e r . d istanceTo ( next ) dOld← prev . distanceTo ( next ) i f (dNew