Cooperation Proposal between IUB Robotics and BioART at Imperial College

Cooperation Proposal between IUB Robotics and BioART at Imperial College Stefan Markov s.markov@iu-bremen.de January 16, 2006 Abstract The project aims to establish cooperation between the International University Bremen Robotics group (IUB Robotics) and the Biologically Inspired Autonomous Robots Team (BioART) at Imperial College in the fields of 3D world modeling and robot action planning, recognition and imitation. Current work at IUB Robotics focuses on the problem of autonomously creating 3D representation of an unstructured indoor environment by a mobile robot using the reproductive perception paradigm [4, 3]. The architecture that is being developed at IUB can extend the system designed by BioART for robot action perception and imitation [6, 9] by alleviating the constraints it imposes on the environment. At the same time, the 3D world modeling framework can be improved significantly by coupling it with the attentive mechanism, which is part of the architecture developed at BioART [18]. Such exchange of knowledge will be a significant benefit for both research groups, resulting in superior performance of their currently developed systems. 1 Introduction Imitation is a fundamental and efficient way for acquisition of knowledge about the world extensively employed by human beings and animals. Learning through imitation has been applied in the field of robotics as well, to teach robots to perform tasks by demonstration, instead of lengthy and, most of the times, difficult programming [15, 17, 22, 8]. At the heart of the ability to imitate lies a mechanism that matches perceived external behaviors with equivalent internal ones [6]. This can be achieved by attribution of mental states to people or other robots. The simulation theory is one of the most widely researched theories of how we attribute mental states [13, 12]. According to it, people use their own mental processes and resources as manipulable models of other people s minds, taken off-line and used in simulation with states derived from taking the perspective of another person. The Biologically Inspired Autonomous Robots Team (BioART) at Imperial College has applied exactly this theory for action recognition and imitation through the usage of internal inverse and forward models [8, 7]. Furthermore, they have enhanced their architecture with an attentive mechanism, which processes the visual scene as perceived by a camera and distributes the limited computational resources to behaviors which rightly require them [18]. There is one very fundamental problem that needs to be solved for the success of the imitation and recognition architecture. If the robot is to be embedded in its environment and attribute mental states to another robot or a human, it must be capable of taking the perspective of 1

the observed being, which means that it needs to create three dimensional representation of its environment. This problem is widely researched in science in different fields like computer graphics [20, 21], pattern matching [1, 2], extraction of 3D objects from a visual scene [10, 16], and many others. All of the developed systems so far impose significant restrictions on the environment or assume the existence of helping markers or beacons to tackle the problem in meaningful time. Current work at the International University Bremen Robotics group (IUB Robotics) aims to develop an architecture for creating online 3D model of the robot s unstructured environment using the reproductive perception paradigm [3, 4], thus alleviating the constraints imposed on the environment, such as presence of high confidence reference points, domain specific cues like sets of fixed object features and the lack of occlusions. This framework can then be employed in the action recognition and imitation system to provide the basis for the perspective taking and later even for action planning and execution. On the other hand, the 3D modeling system can be coupled with the attentive mechanism designed at BioART to cut computational time significantly. This is the goal of this cooperation project and the separate components for it are explained in detail in the following sections. The rest of the proposal is organized as follows section 2 provides a detailed overview of the recognition and imitation architecture developed at BioART, section 3 describes the work done at IUB Robotics in the field of 3D world modeling, section 4 describes the cooperation project which will lead to an improvement in the performance of both frameworks. 2 Biologically Inspired Architecture for Action Recognition and Imitation The neuroscientific discovery of the so-called mirror neurons in area F5 of the macaque monkey premotor cortex which are active both when performing and observing the same action [11] has lead to the proposition that the motor system of animals and humans is actively involved in the process of action recognition [19]. The BioART has used this discovery as an inspiration for a new approach to the problem of action imitation. They model this process through the usage of coupled inverse and forward control models. Inverse models, also called behaviors, take as input the current system state and a goal state and output the system dynamics that will reach the goal state. Contrary, forward models receive as input the current state and the dynamics acting on the system and return the predicted next state of the system [14]. Figure 1 shows the main part of the action recognition and imitation architecture. The fundamental building block is an inverse model, coupled with a corresponding forward model. Each inverse model has a different goal which defines its semantics, for example grasp object, move towards object, close gripper. An execution of a behavior would proceed as follows: first the current state of the system is fed to the inverse model implementing this behavior; then the inverse model outputs motor commands that will reach the goal and they are sent to the motors and to the forward model, which predicts the next state; the prediction is fed back to the inverse model to adjust certain parameters (e.g. speed of movement). The same architecture can be used for action recognition as well. In this case all behaviors run in parallel, but the resulting motor commands are inhibited from reaching the motor system. The prediction of the forward models is then compared to the actual next state of the demonstrator and the difference is used to reward or punish the corresponding inverse model, resulting in its confidence. At the end of the simulation the most confident model is the one chosen by the architecture to represent the demonstrated action. It is important to note that when the robot will execute an action the current state that is fed to the inverse models is its own proprioceptive state, while when trying to recognize an 2

Figure 1: Architecture for action recognition and imitation [6] action (in order to imitate it afterwards, for example) the state of the demonstrator is fed to the system, i.e. the robot takes the perspective of the demonstrator. This is achieved by using forward and inverse visual models. Forward visual models take as input a visual scene and object or feature descriptions and output state information about those objects (for example, position and orientation). In other words they are a model of the visual process. Inverse visual models, on the other side, receive as input visual descriptions of objects and features and their states and output a visual scene corresponding to such descriptions. Once these models are present, the problem of taking the perspective of the demonstrator can be solved as follows: first the forward visual models receive camera input and return the objects present in the scene and their states, i.e. a 3D representation of the observed world; then a viewpoint transformation can be computed from the 3D representation, thus acquiring the perspective of the observed being, which is realized as a new visual scene using inverse visual models [14]. Finally, we should note that the existence of forward visual models is a very difficult and highly researched problem. In order to make the problem tractable and allow for fast processing BioART has simplified the environment in which the robot operates by placing color markers on the objects of interest and setting initial parameters (like distance from a reference object). The described framework designed at BioART is already quite computationally intensive, hence an attention mechanism has been developed which distributes the computational resources only to those behaviors that rightly deserve them. The idea behind the attentive mechanism is to combine the saliency of top-down (or goal-directed) elements, based on multiple hypotheses about the demonstrated action, with the saliency of bottom-up (or stimulus-driven) components [18]. In this way features in the visual scene, such as color, objects and motion influence which inverse models will be simulated at each iteration (for example, there is no need to simulate inverse model pick orange if there is no orange in the scene) and at the same time the behaviors can influence the weight of different low-level features in order to serve them most appropriately (for example increase the weight of motion detected features, since the behavior expects motion). To sum up, the BioART has developed a biologically inspired architecture for action recog- 3

nition and imitation and has coupled it with an attentive mechanism to cut computation time significantly while preserving the performance of the system. They have solved the crucial problem of taking the perspective of the demonstrator by adding color markers to important objects in the scene. In the next section we will see a much more general approach to solve the problem of forward visual models which does not put such constraints on the environment. 3 3D World Modeling Using Reproductive Perception Current work at IUB Robotics aims to develop an algorithm that will learn a three dimensional representation of an unstructured indoor environment by an autonomous mobile robot using the reproductive perception paradigm and an evolutionary algorithm. In traditional understanding of perception, input comes from the sensors, then it is being processed, a step called sensing, and at the end it is perceived, i.e. meaning is assigned to the initial sensorial input. Contrary, in reproductive perception the perception part is trying to generate data and match it to the input from the sensors in the sensing part. In this way, if the data is matched, the system would have developed an internal model of the perceived environment. Figure 2 illustrates this idea [3]. Figure 2: The Reproductive Perception Paradigm The architecture being developed at IUB Robotics is based on the above-described paradigm and uses the Virtual Reality Modeling Language (VRML) to encode a metric 3D environment representation. Inspired by previous successful work on realtime learning of 3D eye-hand coordination with a robot-arm [3], the evolutionary algorithm tries to generate VRML code (also called models) that reproduce the vast amounts of sensor data. This corresponds to the perception part depicted in figure 2. Then, a special metric, which represents the fitness of each model, is used to compare the similarity of the sensor input with a candidate model. The complete learning architecture is shown in figure 3. The global 3D model, i.e. the model of the environment, is represented in VRML. The current sensor-data is gathered in a 3D mesh, where it is matched against data generated from a small population of candidate VRML representations of the local neighborhood. This population is seeded with previously gathered representation data from the global model. The candidate representations are adapted via a fitness function for a few iterations and a refined representation of the local neighborhood is stored back into the global model while the robot moves on. The fitness function is crucial to the performance of the learning algorithm. Unlike other 4

Figure 3: The proposed modeling architecture approaches, which use Hausdorff distance or detect correspondences between invariant features and use transformation metrics on them, we employ the similarity function described in [3] which has several significant advantages. First of all, it can be computed very efficiently. Second, it is not restricted to particular objects like polygons. Furthermore, it operates on the level of raw data points, hence it does not need any preprocessing stages for feature extraction, which cost additional computation time and which are extremely vulnerable to noise and occlusions. As pointed in the introduction this architecture should be able to cope with many of the current constraints of similar systems. The main advantages of the modeling system are that a volumetric representation of objects is used, i.e. we directly learn 3D VRML code and the fitness function used does not need expensive preprocessing. 4 Merging Ideas from both Research Groups The presented architectures in this proposal share to a large degree similar philosophies. Both of them actively use previous knowledge to perceive new sensorial inputs. In the case of the action recognition and imitation system, these are already known inverse models, while for the 3D modeling framework these are recently learned models of the environment and possible objects present in the scene. Furthermore, each architecture can be augmented with components from the other. First of all, the modeling architecture can be directly coupled with the action recognition and imitation system to allow for greater embededdness of the robot in real-life scenarios. As discussed in section 2 BioART has simplified the problem of forward visual models by placing color markers to important objects and having a reference object in the scene. This limits the applicability of the system, as real-life environments do not exhibit such properties. On the other hand side, the system being developed at IUB Robotics aims to solve exactly these problems, hence it can serve as a forward visual model. Moreover, by using VRML for 3D metric representation of the environment the problem of taking the perspective of the observed being becomes trivial once the environment has been learned. It is just a matter of a standard viewpoint transformation within the 3D model. Lastly, the modeling architecture can be used for robot navigation within more complex environments, further improving the performance of the action recognition and imitation framework. 5

While aiming to provide internal models of unstructured environments, the modeling framework is undoubtfully computationally intensive. Computational costs and resource usage can be cut down significantly through the usage of an attention mechanism like the one used in the system developed by BioART. The modeling system can be hinted by low-level salient visual features about which regions of the scene contain objects or their sizes. Furthermore, the motion feature map can be used to direct the evolution of the population by creating a bias towards transformations according to the detected movements. This can be achieved by affecting the initial population of models that needs to evolve through the algorithm. Although the introduction of such attentive mechanism calls for some initial preprocessing, which will slow the modeling framework, the benefits of it should be significantly greater, since experience has shown that evolutionary algorithms perform better if the quality of the initial population is good [5]. 5 Conclusion The current proposal outlined work done at IUB Robotics and BioART in the fields of 3D world modeling and action perception and imitation and presented a cooperation project which, through the combination of components from both architectures will lead to their superior performance. More specifically, the work done at IUB Robotics can be directly embedded in the architecture of BioART to provide for more general visual perception and execution of more complex tasks. At the same time, the attentive mechanism of BioART can be coupled with the modeling framework to cut down computation time significantly. Such exchange of knowledge will not only benefit both research groups by improving their currently developed systems, but also could be the beginning of a continuous cooperation between the two teams. References [1] N. Amenta, M. Bern, and M. Kamvysselis. A new voronoi-based surface reconstruction algorithm. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 415 421. ACM Press, 1998. [2] C. L. Bajaj, F. Bernardini, and G. Xu. Automatic reconstruction of surfaces and scalar fields from 3d scans. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 109 118. ACM Press, 1995. [3] A. Birk. Learning geometric concepts with an evolutionary algorithm. In Proc. of The Fifth Annual Conference on Evolutionary Programming. The MIT Press, Cambridge, 1996. [4] A. Birk. Learning of an anticipatory world-model and the quest for general versus reinforced knowledge. In First International Conference on Computing Anticipatory Systems. AIP Press, 1998. [5] A. Birk and W. J. Paul. Schemas and genetic programming. In Ritter, Cruse, and Dean, editors, Prerational Intelligence, volume 2. Kluwer, 2000. [6] Y. Demiris and G. Hayes. Imitation as a dual route process featuring predictive and learning components: a biologically-plausible computational model. In Imitation in Animals and Artifacts, chapter 12, pages 327 361. MIT Press, 2002. [7] Y. Demiris and M. Johnson. Distributed, predictive perception of actions: a biologically inspired robotics architecture for imitation and learning. Connection Science, 15(4):231 243, 2003. 6

[8] Y. Demiris and M. Johnson. Hierarchies of coupled inverse and forward models for abstraction in robot action planning, recognition and imitation. Proc. AISB 2005 Third International Symposium on Imitation in Animals and Artifacts, 2005. [9] Y. Demiris and B. Khadhouri. Hierarchical attentive multiple models for execution and recognition of actions. Robotics and Autonomous Systems, to appear, 2005. [10] C. Dorai, G. Wang, A. K. Jain, and C. Mercer. Registration and integration of multiple object views for 3d model construction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):83 89, Jan., 1998. [11] V. Gallese, L. Fadiga, L. Fogassi, and G. Rizzolatti. Action recognition in the premotor cortex. Brain, 119:593 609, 1996. [12] V. Gallese and A. Goldman. Mirror neurons and the simulation theory of mind-reading. Trends in Cognitive Sciences, 2(12):493 501, 1998. [13] R. Gordon. Simulation vs theory-theory. In R. A. Wilson and F. Keil, editors, The MIT Encyclopedia of the Cognitive Sciences, pages 765 766. MIT Press, 1999. [14] M. Johnson and Y. Demiris. Perspective taking throughsimulation. Proc. TAROS, pages 119 126, 2005. [15] M. Kaiser and R. Dillmann. Building elementary robot skills from human demonstration. Proc. The IEEE International Conference on Robotics and Automation, 1996. [16] B. Kamgar-Parsi and B. Kamgar-Parsi. Algorithms for matching 3d line sets. IEEE Trans. Pattern Anal. Mach. Intell., 26(5):582 593, 2004. [17] S. B. Kang and K. Ikeuchi. Toward automatic robot instruction from perception - mapping human grasps to manipulator grasps. IEEE Transactions on Robotics and Automation, 13(1):81 95, 1997. [18] B. Khadhouri and Y. Demiris. Compound effects of top-down and bottom-up influences on visual attention during action recognition. 2005. http://www.iis.ee.ic.ac.uk/ y.demiris/publications.html. [19] A. N. Meltzoff and J. Decety. What imitation tells us about social cognition: a rapproachment between developmental psychology and cognitive neuroscience. Phil. Trans. The Royal Society of London, 358:491 500, 2003. [20] S. W. Wang and A. E. Kaufman. Volume-sampled 3d modeling. IEEE Computer Graphics and Applications, 14(5):26 32, Sept., 1994. [21] A. Watt. 3D Computer Graphics, Third edition. Addison-Wesley, 2000. [22] M. Yeasin and S. Chaudhuri. Automatic robot programming by visual demonstration of task execution. Proc. International Conference on Advanced Robotics (ICAR), pages 913 918, 1997. 7