AUGMENTED REALITY METHODS AND ALGORITHMS FOR HEARING AUGMENTATION. Julie Carmigniani. A Thesis Submitted to the Faculty of

Size: px
Start display at page:

Download "AUGMENTED REALITY METHODS AND ALGORITHMS FOR HEARING AUGMENTATION. Julie Carmigniani. A Thesis Submitted to the Faculty of"

Transcription

1 AUGMENTED REALITY METHODS AND ALGORITHMS FOR HEARING AUGMENTATION by Julie Carmigniani A Thesis Submitted to the Faculty of The College of Engineering and Computer Science In Partial Fulfillment of the Requirements for the Degree of Master of Science Florida Atlantic University Boca Raton, Florida December 2011

2

3 ACKNOWLEDGEMENTS I would like to begin by offering my sincerest gratitude to my advisor Dr. Borko Furht. His persistence, encouragements, and advices proved vital in the steps leading up to this point in my education as well as career. Additionally, Dr. Oge Marques who taught me more in more than one class proved himself a remarkable professor and for that I am very appreciative. I would also like to express my deepest appreciation to my parents Vincent and Marie-Christine Carmigniani whose constant support, love, and trust propelled me to this point in my life. Finally, I would like to thank my friend, Tegan Story-Llewellyn, for her friendship, support, and advices that have meant more for me than I could ever say. iii

4 ABSTRACT Author: Title: Institution: Thesis Advisor: Degree: Julie Carmigniani Augmented Reality Methods and Algorithms for Hearing Augmentation Florida Atlantic University Dr. Borko Furht Master of Science Year: 2011 While new technologies are often used to facilitate regular people s lives, they often fail to see their potential in helping disabled people. Augmented reality, one of the newest state-of-the-art technologies, offers users the opportunity to add virtual information to their real world surroundings in real time. It also has the potential to not only augment the sense of sight, but also other senses such as hearing. Augmented reality could be used to offer the opportunity to complement users missing sense. In this thesis, we study augmented reality technologies, systems and applications, and suggest the future of AR applications. We explain how to integrate augmented reality into ios applications and propose an augmented reality application for hearing augmentation using an ipad2. We believe mobile devices are the best platform for augmented reality as they are widespread and their computational power is rapidly growing to be able to handle true AR applications. iv

5 AUGMENTED REALITY METHODS AND ALGORITHMS FOR HEARING AUGMENTATION LIST OF TABLES... viii LIST OF FIGURES... ix CHAPTER 1: INTRODUCTION Background Introduction to Augmented Reality History and Background of Augmented Reality Motivation Thesis Contribution Thesis Organization... 7 CHAPTER 2: AUGMENTED REALITY TECHNOLOGY, SYSTEMS AND APPLICATIONS Augmented Reality Technologies Computer Vision Methods in Augmented Reality AR Devices Displays Input Devices Tracking Computer v

6 AR Interfaces Tangible AR Interfaces Collaborative AR Interfaces Hybrid AR Interfaces Multimodal AR Interfaces AR Tool Kits Augmented Reality Systems AR General Systems AR Mobile Systems Socially Acceptable Systems Personal and Private Systems Tracking Technology Mobile AR Systems Design in this Thesis Augmented Reality Applications Advertising and Commercial AR Applications Entertainment and Education AR Applications Medical AR Applications Mobile (iphone) AR Applications Augmented Reality s Future Direction CHAPTER 3: USING THE IOS PLATFORM TO BUILD AN AUGMENTED REALITY APPLICATION Motivation First Phase Design vi

7 Picture View Controller Final Phase Design AV Foundation Framework Implementation CHAPTER 4: DESIGN OF AN AUGMENTED REALITY HEARING APPLICATION Motivation Design OpenEars Speech Recognition System ARPA Language Model Lmtool Language Modeler Pocketsphinx Library OpenCV Face Detection Algorithm OpenCV Library Brief Introduction Face Detection Algorithm Viola-Jones Method Prototype Results and Screenshots CHAPTER 5: SUMMARY AND FUTURE RESEARCH DIRECTION Summary Future Work REFERENCES vii

8 LIST OF TABLES Table 2.1: Comparison of different techniques for different types of displays Table 2.2: Comparison of common tracking technologies Table 2.3: Abbreviations Table 2.4: AR System Comparison Table 2.5: Systems preferred type of technologies Table 4.1: OpenCV Portability Guide for Release 1.0 [BrKa08] Table 5.1: ios API/Library used viii

9 LIST OF FIGURES Figure 1.1: Milgram's Reality-Virtuality Continuum [MiKi94]... 1 Figure 1.2: Ivan Sutherland's Head-Mounted Display [IS09]... 5 Figure 2.1: Point constraints for the camera pose problem [AbMa08] Figure 2.2: HMD Display [ReSc03] Figure 2.3: Handheld displays [WaSc06] Figure 2.4: Example of SAR [SaMo] Figure 2.5: MINI Advertisement Figure 2.6: right picture: Picture of a virtual motorcycle prototype (on the right) next to a physical motorcycle prototype (on the left) in the real environment; left picture: Virtual prototype in a virtual environment [SaMo] Figure 2.7: Virtual factory prototype [SaMo] Figure 2.8: Virtual office prototype [SaMo] Figure 2.9: Virtual testing of a lighting system [SaMo] Figure 2.10: User trying on virtual shoes in front of the Magic Mirror [SaMo] Figure 2.11: Cisco's AR commercial where a client is trying on clothes in front of a "magic" screen...40 Figure 2.12: Augmented reality view of Dashuifa [HuLi09] Figure 2.13: Visitor with guidance system [MiMe08] ix

10 Figure 2.14: Mobile phone-enabled guidance in a museum [BrBr07] Figure 2.15: ARCC [CoKe04] Figure 2.16: Bichlmeier et. al. system for viewing through the skin [BiWi07] Figure 2.17: Sequential screenshots of knot tying task [AkRe06] Figure 2.18: Wikitude Drive Figure 2.19: Firefighter 360 app Figure 2.20: Le Bar Guide app Figure 2.21: From top to bottom and left to right: DARPA's contact lens project [Wi08], MIT's Sixth Sense [MiMa09], Contactum's AR solutions [Co06]...53 Figure 2.22: From top to bottom and left to right: Examples of futuristic augmented reality and Babak Parviz's contact lens [Pa09]...54 Figure 3.1: itranslatar first phase design picture translation general system Figure 3.2: itranslatar - Screenshots Figure 3.3: Image operation set up snapshot Figure 3.4: Image processing for OCR snapshot Figure 3.5: itranslatar final phase design video feed translation general system Figure 3.6: AV Foundation framework video frame analyzes set up snapshot Figure 4.1: ihear General System Design Figure 4.2: Speech detected snapshot Figure 4.3: Speech detected and frame analyzed for face snapshot Figure 4.4: Frame processing snapshot Figure 4.5 : ARPA file format x

11 Figure 4.6: General Speech Recognition System Figure 4.7: Face detection method Figure 4.8: Creating an OpenCV image object from Objective-C image object Figure 4.9: OpenCV Basic Structure Figure 4.10: OpenCV Detector Algorithm Architecture Figure 4.11: Haar features from the cascade files that ship with OpenCV Haarcascade Frontalface Default [Ha10]...89 Figure 4.12: Haar Features Figure 4.13: Rectangle with coordinates A(x1, y1), B(x2, y2), C(x3, y3) and D(x4,y4)...91 Figure 4.14: AdaBoost classifier cascade as a chain of filters [He07] Figure 4.15: ihear ready to listen with "This is my application" speech detected Figure 4.16: ihear recognized "This is working well" Figure 4.17: ihear - Screenshots: Demonstration of face detection Figure 4.18: ihear Screenshots: : Demonstration of translation Figure 4.19: ihear Screenshots: : Demonstration of translation xi

12 1. INTRODUCTION 1.1. BACKGROUND INTRODUCTION TO AUGMENTED REALITY Augmented Reality, also commonly referred to as AR, is defined as a real-time direct or indirect view of a physical real-world environment that has been enhanced or augmented by adding virtual computer-generated information to it [CaFu11]. AR is both interactive and registered in 3D as well as combines real and virtual objects. Milgram s Reality-Virtuality Continuum is defined by Paul Milgram and Fumio Kishino as a continuum that spans between the real environment and the virtual environment, and comprises Augmented Reality and Augmented Virtuality (AV) in between, where AR is closer to the real world while AV is closer to a pure virtual environment [CaFu10]. The Milgram s Reality-Virtuality Continuum can be seen in Figure 1.1. Figure 1.1: Milgram's Reality-Virtuality Continuum [MiKi94] Augmented Reality aims at simplifying the user s life by bringing virtual information not only to his immediate surroundings, but also to any indirect view of the real-world environment, such as live-video streams with recorded and live football games 1

13 where different color lines help the viewer better understand the changes in perception to see where players are located with regard to the important positions. AR enhances the user s perception of and interaction with the real-world. While Virtual Reality (VR) technology or Virtual Environment as referred to by Milgram completely immerses the users in a synthetic environment without having any contact with the real world, AR technology offers the user a dual world with both real and synthetic environment where the virtual objects are added in such a way as to augment the user s perception in real time so as to provide them with a better understanding of their surroundings. Virtual objects added to the real environment show information that the user cannot directly detect with his/her senses. The information passed on by the virtual object can help the user in performing daily work tasks, such as guiding workers through electrical wires in an aircraft by displaying pertinent digital information or by guiding maintenance workers through displaying which units might need repair or are due for maintenance. The information can also have a simple entertainment purpose, such as Wikitude and many mobile augmented reality applications. AR applications cover a wide range of classes such as medical visualization, entertainment, advertising, maintenance and repair, annotation, robot path planning, etc. Augmented Reality is considered by many [CaFu11, AzBa01] to not be restricted to a particular type of display technology such as head-mounted display (HMD), nor is it considered to be limited to the sense of sight. AR can potentially be applied to all senses, augmenting smell, touch and hearing as well. In this effect, AR can also be used to augment or replace a user s missing sense by sensory substation, such as augmenting hearing for deaf users by the use of visual cues as will be discussed later on in this 2

14 research work. Similarly, many [CaFu11, AzBa01] also consider AR applications that require removing real objects from the environment and are commonly referred to as mediated or diminished reality, in addition to adding virtual objects. In fact, removing objects from the real environment corresponds to covering the object with virtual information that matches the background in order to give the user the impression that the object is not there HISTORY AND BACKGROUND OF AUGMENTED REALITY The first appearance of Augmented Reality (AR) dates back to the 1950s when Morton Heilig, a cinematographer, thought of cinema as an activity that would have the ability to draw the viewer into the onscreen activity by taking in all the senses in an effective manner. In 1962, Heilig built a prototype of his vision named Sensorama, which he described in 1955 in The Cinema of the Future and predated digital computing. [CaFu11]. Next, Ivan Sutherland invented the head mounted display in 1966 as seen in Figure 1.2. In 1968, Sutherland was the first one to create an augmented reality system using optical see-through head mounted display [IS09]. In 1975, Myron Krueger creates the Videoplace, a room that allows users to interact with virtual objects for the first time. Later, Tom Caudell and David Mizell from Boeing create the term Augmented Reality while helping workers assemble wires and cables for an aircraft [CaFu11]. They were also the first to start discussing the advantages of augmented reality versus virtual reality, such as requiring less power since fewer pixels are needed [IS09]. In the same year, L.B Rosenberg developed one of the first functioning AR systems, called Virtual Fixtures and demonstrated its benefit on human performance while Steven Feiner, Blair MacIntyre and Doree Selgmann presented the first major paper on an AR system prototype named 3

15 KARMA. The reality virtuality continuum seen in Figure 1.1 is not defined until 1994 by Paul Milgram and Fumio Kishino as a continuum that spans from the real environment to the virtual environment. Augmented reality and augmented virtuality are located somewhere in between with AR being closer to the real environment and AV being close the virtual environment. In 1997, Ronal Azuma writes the first survey in AR providing a widely acknowledged definition of AR by identifying it as combining real and virtual environment while being both registered in 3D and interactive in real time [IS09]. The first outdoor mobile AR game, ARQuake, is developed three years later by Bruce Thomas and demonstrated during the International Symposium on Wearable Computers. In 2005, the Horizon Report predicts that AR technologies will emerge more fully within the next 4-5 years; and, as to confirm that prediction, camera systems that can analyze physical environments in real time and relate positions between object and environments are developed the same year. This type of camera system has become the basis to integrate virtual objects with reality in AR systems. In the following years, more and more AR applications are developed especially with mobile applications, such as Wikitude AR Travel Guide launched in 2008, but also with the development of medical applications in Nowadays, with the new advances in technology, an increasing amount of AR systems and application have emerged, notably MIT sixth sense prototype, and it is expected that more and more of these mobile AR applications will surface with the frequent appearances of newer and more advanced portable devices such as latest Android phones, iphones and ipads that have more and more the capability of handling the computational needs required by augmented reality applications. 4

16 Figure 1.2: Ivan Sutherland's Head-Mounted Display [IS09] 1.2. MOTIVATION While new technologies are often used to facilitate regular people s lives, they often fail to see their potential in helping disabled people. Augmented reality currently offers users the opportunity to add virtual information to their real world surroundings in real time. It also has the opportunity to not only augment the sense of sight, but also other senses such as hearing. In this sense, augmented reality could be used to offer the opportunity to complement a user s missing sense. For instance, augmented reality can be used to help deaf people in hearing what an interlocutor might be telling them, without needing to know how to read lips or without needing the interlocutor to need to know sign language by outputting the person s speech next to their head in an intuitive, easy to interact with, and easily understandable way. As an addition, such a system can also be used as a convenient translator by travelers. 5

17 Using an ipad2 with a camera as the interaction device, this thesis introduces a system with speech recognition and optional language translation and display of the resulting string in an easy and natural way to use by detecting a face and its corresponding position present in the frames and outputting the resulting string next to the detected face in a cartoon-like bubble. The use of this system, dubbed ihear, consist in having the user simply angles the device towards the person s face and once both text and face are detected, the final string is outputted on the screen without requiring any additional steps from the user. This system presents one of the different aspects of augmented reality that does not only involve augmenting the sense of sight, but also involves hearing augmentation. In addition, the following conditions and limitations are assumed: The user does not know sign language or how to read lips, The environment is quiet and free of background noise, The system will be used for one-on-one conversation, Speech recognition needs to happen on the device so as to not depend on network availability THESIS CONTRIBUTION The contributions of this research are as follows: A detailed description of augmented reality technologies, systems, and applications, notably mobile applications, and the future of augmented reality technologies. 6

18 The design and implementation of a picture translator, dubbed itranslatar, using the ios platform, Tesseract OCR and Google Translate as a way of introducing the ios platform and integration of foreign libraries and APIs within the ios platform. The design of an augmented reality video frame translating application using the itranslatar original design as a starting point. Although itranslatar augmented reality design was not implemented, a detailed explanation of the design and steps to implementing ios AV Foundation framework used in the main focus of this research is provided along with this work. The design of an augmented reality hearing application dubbed ihear using the ios platform, OpenEars open source as the means for speech recognition, OpenCV as the means for image/frame processing, and Google Translate as the means for language-to-language translation. A detailed explanation of the OpenEars speech recognition system and the OpenCV face detection algorithm is provided along with this work. The implementation of the augmented reality hearing application design on ipad THESIS ORGANIZATION Chapter 1 of this thesis provides an introduction and historical background to the state-of-the-art technology that is augmented reality and the objective of this research using augmented reality. Chapter 2 provides a more in-depth explanation of augmented reality current systems, tools and applications along with a better understanding of mobile applications challenges and the future direction augmented reality is taking. In 7

19 Chapter 3, an augmented reality application for picture/video frame translation s design is presented along with some of the critical algorithm used during the first phase implementation of the design. Chapter 4 introduces an augmented reality application for hearing augmentation s design in conjunction with its critical algorithms used for speech recognition and face detection during the implementation of the design. Results of the prototype developed for this research are also discussed in Chapter 4. Chapter 5 concludes this body of work while presenting new directions for potential future research and further implementation of the proposed two augmented reality applications designs. 8

20 2. AUGMENTED REALITY TECHNOLOGY, SYSTEMS AND APPLICATIONS 2.1. AUGMENTED REALITY TECHNOLOGIES Augmented reality being a state-of-the-art technology requires state-of-the-art algorithms and devices to be feasible. The main tools to augmented reality are computer vision, devices capable of handling heavy computations, and intuitive interfaces to provide the user with a natural and easy-to-use way to interact with the AR system. These tools are described in this section COMPUTER VISION METHODS IN AUGMENTED REALITY Computer vision renders 3D virtual objects from the same viewpoint from which the images of the real scene are being taken by tracking cameras. Augmented reality image registration uses different methods of computer vision mostly related to video tracking. These methods are usually divided into two stages: tracking and reconstructing/recognizing. During tracking, fiducial markers, optical images, or interest points are detected in the camera images using various image processing methods such as feature detection or edge detection. Most of the computer vision tracking techniques can be divided into two classes: feature-based tracking and model-based tracking. Featurebased methods consist of discovering the connection between 2D image features and their 3D world frame coordinates. Model-based methods make use of the model of the desired tracked objects features such as CAD models or 2D templates of the item using 9

21 distinguishable features. Once a connection is made between the 2D image and the 3D world frame, the camera pose can be calculated by projecting the 3D coordinates of the feature into the observed 2D image coordinates and by minimizing the distance to their corresponding 2D features. Camera pose estimation most often uses point features to determine the appropriate constraints. The reconstructing/recognizing stage makes use of the data obtained during the first stage to reconstruct the real world coordinate system [CaFu11]. Assuming a calibrated camera and a perspective projection model, if a point has coordinates ( x, y, z) T in the coordinate frame of the camera, its projection onto the image plane is ( x / z, y / z, 1) T. In point constraints, there are two principal coordinate systems, illustrated in Figure 2.1: the world coordinate system W and the 2D image coordinate system. Let i ( x y z ) T p,,, where i = 1,..., n, with n 3, be a set of 3D non-collinear reference points i i i in the world frame coordinate and q ( x, y ', z ') T i i ' be the corresponding camera-space i i coordinates, p i and q i are related by the following transformation: q = Rp T (1) i i + where r R = r r T 1 T 2 T 3 and t x T = t y (2) t z are a rotation matrix and a translation vector, respectively. 10

22 11 Let the image point h i (u i, v i, 1) T be the projection of p i on the normalized image plane. The collinearity equation establishing the relationship between i h and i p using the camera pinhole is given by: ( ) T Rp t p r h i z i T i + + = 3 1 (3) The image space error gives a relationship between 3D reference points, their corresponding 2D extracted image points, and the camera pose parameters, and corresponds to the point constraints [AbMa08]. The image space error is given as follows: = z i T y i T i z i T x i T i p i t p r t p r v t p r t p r u E (4) where T i i i v u m,1, are the observed image points. Figure 2.1: Point constraints for the camera pose problem [AbMa08]

23 In augmented reality, most methods of computer vision assume the presence of fiducial markers or objects with known 3D geometry in the environment, and make use of those data. Static systems of known initial positions have the 3D scene structure calculated beforehand, while dynamic systems make use of Simultaneous Localization And Mapping (SLAM) technique for mapping fiducial markers or 3D models relative positions. In the rare but growing cases when no assumptions about the 3D geometry of the scene can be made, Structure from Motion (SfM) method is used. SfM method is divided into two parts: feature point tracking and camera parameter estimation. Augmented reality tracking methods depend greatly on the type of environment and the type of AR system built. Knowing that the environment might be indoor, outdoor or a combination of both will dictate the lightning control capability of the system beforehand, which is an important factor in computer vision. Likewise, whether the AR system built is meant to be static or dynamic, or whether the environment it will be used in is known or unknown, will also dictate the type of computer vision methods to use. Although visual tracking now has the ability to recognize and track a lot of things, it mostly relies on other techniques such as GPS and accelerometers. For example, for a computer to detect and recognize something as familiar and obvious as a car is very quite a challenge. The surface of most cars is both shiny and smooth and most of the feature points come from reflections and thus are not relevant for pose estimation and even sometimes recognition. The few stable features that one can hope to recognize, such as window corners or wheels are extremely difficult to match due to reflection and transparent parts. While this example is a bit extreme, it shows the difficulties and 12

24 challenges faced by computer vision with most objects having an irregular shape: food, flowers, and most artifacts. A recent new approach for advances in visual tracking has been to study how the human brain recognizes objects, also referred to as the Human Vision System (HVS), as it is possible for humans to recognize an infinite number of objects and persons in fractions of seconds without making any effort. If the way of recognizing things by the human brain can be modeled, computer vision will be able to handle the challenges it is currently facing and keep moving forward [CaFu11] AR DEVICES Augmented reality s main concerns when selecting appropriate devices are displays, input devices, tracking, and computer DISPLAYS There are three major types of displays used in Augmented Reality: head mounted displays (HMD), handheld displays and spatial displays. HMD can be worn directly on the head or as part of a helmet and place both images of the real and virtual environment over the user s view. There are four types of HMD: video-see-through and optical-see-through and each can be either monocular or binocular display optic. Video-see-through requires the user to wear two cameras on their head as well as the processing of both cameras to provide the real part of the augmented scene and the virtual objects with unmatched resolution. Optical-see-through, on the other hand, employs half-silver mirror technology to allow the view of the physical world to pass through the lens and graphically overlay information to be 13

25 reflected in the users eyes. While video-see-through systems are more demanding than optical-see-through systems due to the processing, video-see-through systems allow much more control over the result since the augmented view is already composed by the computer. This can help in synchronizing the virtual image with the scene before displaying to the user and thus reduces the effect of jittering of the virtual objects. Figure 2.2: HMD Display [ReSc03] Handheld displays consist of small computing devices with a display that the user can hold in their hands. They use video-see-through techniques to overlay graphics onto the real environment and can often be directly use as the main platform. This research considers three distinct classes of commercially available handheld displays that are being used for augmented reality: smart-phones, PDAs and Tablet PCs. With most of the recent advances, smart-phones such as iphones and Tablets such as ipads are becoming the preferred platforms for developing augmented reality applications as they do not require additional hardware to use, they present a combination of powerful CPU, camera, GPS, and solid state compasses, and they are extremely portable and widespread. 14

26 Figure 2.3: Handheld displays [WaSc06] Spatial augmented reality, also known as SAR, makes use of video-projectors, optical elements, holograms, radio frequency tags, and other tracking technologies to display graphical information directly onto physical objects without requiring the user to wear or carry the display. Spatial displays separate most of the technology from the user and integrate it into the environment, thus offering the ability to naturally scale up to groups of users and offering collaboration between users. This also increases the interest for such augmented reality systems in universities, labs, museums, and in the art community. There are three different approaches to SAW which mainly differ in the way they augment the environment: video-see-through, optical-see-through and direct augmentation. Video-see-through displays in SAR are screen-based and are a common technique used if the system does not have to be mobile as they are cost efficient since only off-the-shelf hardware components and standard PC equipment is required. Spatial optical-see-through displays generate images that are aligned with the physical environment and often comprise spatial optical combiners, such as planar or curved mirror beam splitters, transparent screens, or optical holograms. Neither screen-based video-see-through spatial systems nor spatial optical-see-through systems support mobile applications due to spatially aligned optics and display technology. Lastly, projectorbased spatial displays apply front-projection to seamlessly project images directly onto 15

27 physical object s surfaces [BiRa07]. Table 2.1 shows a comparison of different types of display techniques for augmented reality. Figure 2.4: Example of SAR [SaMo] INPUT DEVICES There are many types of input devices for augmented reality systems. Developers can make use of gloves [ReSc03], wireless wristbands [FeMu05], Wii wireless remotes, etc. but the author of this thesis believes that smart-phones and tablets such as iphones and ipads are currently the most useful type of input devices as they can also be used as the main platform for the augmented reality system. It is important, however, to point out that the input devices chosen also greatly depend on the type of application the system is being developed for as well as the display chosen. For instance, if an application requires the user to be free handed to use it, the input device will need to be one that still allows the user to perform the necessary tasks without requiring extra unnatural gestures or to be held by the user. Examples of such input devices include gaze interaction in [LeLe10] or the wireless wristband used in [FeMu05]. Similarly, if a system makes use of a handheld display such as an iphone, the developer can utilize the touch screen input device. 16

28 Table 2.1: Comparison of different techniques for different types of displays Types of Displays HMD Handheld Spatial Techniques Types of Displays Video-see-through HMD Handheld Video-seethrough Optical-seethrough Video-seethrough Optical-seethrough Direct Augmentation Advantages complete visualization control, possible synchronization of the virtual and real environment employs a halfsilver mirror technology, more natural perception of the real environment portable, widespread, powerful CPU, camera, accelerometer, GPS, and solid state compass portable, powerful CPU, camera, accelerometer, GPS, and solid state compass more powerful cost efficient, can be adopted using off-theshelf hardware components and standard PC equipment more natural perception of the real environment displays directly onto physical objects' surfaces 17 Disadvantages requires user to wear cameras on his/her head, requires processing of cameras video stream, unnatural perception of the real environment time lag, jittering of the virtual image small display becoming less widespread, small display more expensive and heavy does not support mobile system does not support mobile system not user dependent: everybody sees the same thing (in some cases this disadvantage can also be considered to be an advantage)

29 TRACKING Tracking devices consist of digital cameras and/or other optical sensors, GPS, accelerometers, solid state compasses, wireless sensors, etc. Each of these technologies has a different level of accuracy and depends greatly on the type of system being developed. Table 2.2 shows a comparison of common tracking techniques adapted from [PaSi08] and [DiHo07], where range corresponds to the size of the region that can tracked within, setup is the amount of time for instrumentation and calibration, precision represents the granularity of a single output position, time is the duration for which useful tracking data is returned before it drifts too much, and environment corresponds to the location the tracker can be used: indoors or outdoors COMPUTER Augmented reality systems require powerful CPUs and considerable amount of RAM to process camera images. Most mobile computing systems employ a laptop in a backpack configuration, but with the recent advances in technology, we can hope to see this backpack configuration replaced by a lighter and more sophisticated looking system. Stationary systems can use a traditional workstation with a powerful graphics card. 18

30 Table 2.2: Comparison of common tracking technologies Technology Range (m) Setup time (hr) Precision (mm) Time (s) Environment Optical: marker-based in/out Optical: marker less in/out Optical: outside-in in Optical: inside-out in/out GPS out WiFi in/out Accelerometer in/out Magnetic in/out Ultrasound in Inertial in/out Hybrid in/out UWB in RFID: active when needed 500 in/out RFID: passive when needed 500 in/out AR INTERFACES One of the most important aspects of augmented reality is to create appropriate techniques for intuitive interaction between the user and the virtual components of augmented reality applications. This research has identified four main ways of interaction in AR applications: tangible AR interfaces, collaborative AR interfaces, hybrid AR interfaces, and the emerging multimodal AR interfaces TANGIBLE AR INTERFACES Tangible interfaces support direct interaction with the real world by exploiting the use of real, physical objects and tools such as touch screen display in smart phones or the use of sense tables as with TaPuMa [MiKu08]. TaPuMa is a table-top tangible interface, 19

31 also called sense table, which uses physical objects to interact with digital projected maps using real-life objects the user carries as queries to find locations or information on the map. The advantage of this type of application is found in using objects as keywords thus eliminating language barriers in conventional graphical interfaces that are often mistranslated. While objects can be interpreted in different ways based on culture and age, they still can be considered more meaningful to users. Other examples of tangible objects include the use of wristbands such as in [CoKe04] and other various physical objects to interact with augmented reality systems. Tangible interfaces are currently the most common form of augmented reality interfaces COLLABORATIVE AR INTERFACES Collaborative interfaces include the use of multiple displays to support remote and co-located activities. Co-located sharing uses 3D interfaces to improve physical collaborative workspace while remote sharing effortlessly integrates multiple devices with multiple locations to enhance teleconferencing. Such interfaces could also be integrated with medical applications for performing diagnostics, surgery, or even maintenance routine HYBRID AR INTERFACES Hybrid interfaces combine an assortment of different, but complementary interfaces as well as the possibility to interact through a wide range of interaction devices. They also provide a flexible platform for unplanned, everyday interaction where it is not known in advance which type of display or device will be used. An example of such a system is Sandor. et al. augmented reality system implemented to support end 20

32 users in assigning physical interaction devices to operation as well as virtual objects on which to perform those procedures, and in reconfiguring the mappings between devices, objects and operations as the users interact with the system [SaOl05] MULTIMODAL AR INTERFACES Multimodal interfaces combine real objects input with naturally occurring forms of language and behaviors such as speech, touch, natural hand gestures, or even gaze. These types of interfaces have more recently emerged and examples of it include MIT s sixth sense [MiMa09] wearable gesture interface, dubbed WUW, as it supplies the user with information projected onto surfaces, walls, and physical objects through natural hand gestures, arm movements, and/or interaction with the object itself. This type of interaction is now being largely developed and is sure to become one of the preferred types of interaction for future augmented reality application as it offers a relatively robust, efficient, expressive and highly mobile form of human-computer interaction that represents the user s preferred interaction style. In addition, they have the capability to support users ability to flexibly combine modalities and choose from interaction methods such as speech or gesture, or to switch from one input mode to another depending on the task or setting. This freedom to choose the mode of interaction is crucial to wider acceptance of pervasive systems in public places [CaFu11] AR TOOL KITS Augmented reality being such a hot topic but not such an easy thing to accomplish, AR tool kits are a necessity and a convenience for developers that might not have the necessary background in image processing but wish to develop an augmented 21

33 reality system or application. One of the most well-known computer vision libraries for creating augmented reality applications is the ARToolKit developed in 1999 by Hirokazu Kato from the Nara Institute of Science and Technology and released by the University of Washington HIT Lab. It uses video tracking capabilities to calculate in real time the real camera position and orientation relative to physical markers. Once the real camera position is known, a virtual camera can be placed at the same exact position and 3D computer graphics model can be drawn to overlay the markers. The extended version of ARToolKit, ARToolKitPlus added many features over the ARToolKit, notably classbased APIs, but it is no longer being developed and already has a successor: Studierstube Tracker [CaFu11]. Studierstube Tracker s concepts are very similar to ARToolKitPlus; however, its code-base is completely different and it is not an open source, thus not available for download. It has the ability to support mobile phone with Studierstube ES, as well as PCs, making its memory requirements very low: 100KB, which corresponds to 5 to 10% of the memory requirements of the ARToolKitPlus; and its processing very fast: about twice as fast as ARToolKitPlus on mobile phones and about 1ms per frame on a PC [ScFu00]. Studierstube Tracker is highly modular and developers can extend it in any way by creating new features for it. When first introducing Studierstube, the designer had in mind a user interface that uses collaborative augmented reality to bridge multiple user interface dimensions: multiple users, contexts, and locales as well as applications, 3Dwindows, hosts, display platforms, and operating systems [ScFu00]. 22

34 2.2. AUGMENTED REALITY SYSTEMS AR GENERAL SYSTEMS This research divides augmented reality systems into five categories: fixed indoor systems, fixed outdoor systems, mobile indoor systems, mobile outdoor systems, and mobile indoor and outdoor systems. We define mobile system as a system that does not constrain the user to one setting and thus allows for movement through the use of a wireless system. Fixed systems on the other hand do not allow to be moved while in use and the user must use these systems wherever they are set up without having the flexibility to move unless the whole system set up is relocated. Choosing what type of system to build is the first thing the developer should do as it will dictate the tracking system, display device and possibly the interface to select. For instance, fixed systems will not need to make use of GPS tracking while outdoor systems will most likely need to. This research conducted a study using 25 randomly selected recent papers that were classified according to their systems, and determined what tracking techniques, display type and interface were used for each. Tables 4 and 5 show the results of the study and Table 3 shows the meaning of the abbreviation used in Tables 4 and 5. The papers used for the study were all published between 2002 and 2010 with a majority of papers published between 2005 and The results from this study can be used to show the most popular type of systems developed so far since the papers were more or less randomly chosen from recent papers. However, it should be taken into account that mobile systems have been on the rise lately and while mobile indoor and outdoor systems represent only 12% of the papers studied, 23

35 due to the recent advances in technology, developers are looking more and more into this type of system as they have the most chance of making it into the market. Note that in the mobile indoor and outdoor systems, one of the system studied (Costanza s eye-q [CoIn06]) does not use any tracking techniques, while others use multiple type of tracking techniques. This is due to the fact that this system was developed as a personal, subtle notification display. It was also noted that there was only one paper studied that discussed fixed outdoor systems because this type of system is not popular due to its inflexibility. Although these results cannot be used as a general rule when building an augmented reality system, they can serve as a pointer of what type of tracking techniques, display, or interface is more popular for each system type. Developers should also keep in mind that these choices depend also on the type of applications, although as can be seen, the application does not necessarily guide the type of system. From this study, we see that optical tracking is mostly preferred in fixed systems, while a hybrid approach is most often preferred for mobile systems. HMDs are often the preferred type of display choice; however, we predict that they will need to become more fashionably acceptable for systems using them to reach the market. When it comes to interfaces, the most popular choice is tangible interfaces, but we predict that multimodal interfaces will become more famous with developers within the next years as we believe that they also have a better chance to reach the public industry. 24

36 Table 2.3: Abbreviations Abbreviation Meaning Abbreviation Meaning Adv Advertising IO inside out Ent. and Ed. Entertainment and Education OI outside in Med. Medical MB marker-based Nav. and Info. Navigational and Informational ML marker less Gen. General EC electronic compass 25

37 26 Table 2.4: AR System Comparison

38 27 Table 2.5: Systems preferred type of technologies

39 AR MOBILE SYSTEMS Augmented reality mobile systems include mobile phone applications as well as wireless systems such as MIT s sixth sense [MiMa09] and eye-q [CoIn06]. Users interact with mobile AR systems through the use of wearable mobile interfaces in a natural and socially acceptable way. Using mobile phones for developing augmented reality applications present both advantages and drawbacks. Most mobile devices of interest are equipped with cameras, and nowadays, these cameras are getting better and better with auto focusing lenses and the possibility to capture high level images. In addition, most cell phones provide accelerometers, magnetometers and GPS from which AR can benefit. However, in spite of rapid advances in mobile phones, their computing platform for real-time imaging applications is still rather limited if done using the cell phone s platform. As a result, many applications send data to a remote computer that does the computation and send the results back to the mobile device, but this approach is not well adapted to AR due to limited bandwidth. Nevertheless, considering the rapid development in mobile devices computing power, it can be considered feasible to develop real-time augmented reality application that locally process information in the near future. In fact, the main focus of this research makes use of an ipad2 to create a hearing augmented reality application that processes both frames and speech-listening locally. While the augmented reality processing is not quite up-to-speed, the author was able to optimize the processing of the algorithms to only occur when necessary. Furthermore, rumors about upcoming iphone devices claim that they will be equipped with the current ipad2 dual-core processor and there is good reason to believe that the 28

40 ipad3 will be equipped with an even more powerful processor. Likewise, upcoming Android mobile devices are claimed to be equipped with dual-core processors. This thesis defines a successful augmented reality mobile system as an application that enables the user to focus on the application or system rather than on the computing device, interact with the device in a natural and socially acceptable way, and provide the user with information that can be shared if necessary. This suggests the need for lightweight, wearable or mobile devices that are fashionably acceptable, private and possess robust tracking technology SOCIALLY ACCEPTABLE SYSTEMS Mobile phones and PDA s reminders, messages, calls, etc. have been judged as distractions and are mostly considered not to be socially acceptable as they not only disrupt the person whose phone or PDA is receiving an interruption of more or less importance, but also the other persons present in the same room, whether they are having a conversation with the disruptor (as this is how the person whose device is disturbing the room will be seen as) or the disruptor is in a public place, such as a bus. As a result, many research groups such as [FeMu05], [CoIn06] have decided that interaction with augmented reality systems implemented on mobile devices needs to be subtle, discrete and unobtrusive so as to not disrupt the user if s/he is under a high load of work and the disruption is not of high priority level. A system that is subtle, discrete and unobtrusive thus becomes socially acceptable. The main issue with social acceptance comes from the level of disruption portable devices create in public places and during conversations and meetings. In [CoIn06], the authors studied the peripheral vision and adapt their mobile device s information cue, eye-q which is a pair of glasses, so that it does not occlude the 29

41 foveal field of view, which is the main focus of the human field of view. The cues become less and less visible depending on the level of concentration and work-load of the user, making it naturally adaptive to the user s cognitive workload and stress. Moreover, since the cues are only visible to the user and will only disrupt the user depending on his level of concentration, these cues can be considered to be socially acceptable. This research also considers multimodal interfaces to be crucial to wider acceptance of pervasive systems in public places as they offer the user the freedom to choose from a range of interaction modes giving the user the opportunity to select the most appropriate and socially acceptable mean of communication with their devices. Socially acceptable systems also need to consider a natural way of interaction between the user and the device. If the interface and way of interaction between the user and the device is unnatural, it will appear awkward of use in public places. For instance, in [FeMu05], the authors have created an augmented reality system that uses a wireless wristband including an RFID reader, 3-axis accelerometer and RF communication facilities, a cell phone and a wireless earpiece to allow the user to interact with services related to objects using the RFID tags through implicit touch-based gestures. Once an object is detected in the user s hand, the user can interact with information about this object using natural slight wrist gestures while previous commercial interfaces that supported hands-free and eyes-free operations required speech recognition, which not only suffers from medium performance and noisy conditions, but is also not considered to be socially acceptable. Mobile augmented reality systems that wish to step from the laboratories to the industry will also be facing fashion issues as the users are unlikely to want to wear HMDs 30

42 or other visible devices that might be deemed as geeky. As a result, it is up to the developers of AR mobile systems to take into account fashion trends as they might be a big obstacle to overcome. Groups such as MIT Media Lab, constantly try to reduce the amount of undesirable visible devices or arrange them in different design choice. WUW s first development stage integrated the camera and projector into a hat and the second development integrated it into a pendant. They were also researching a way to replace the need to wear colored marker caps on finger tips [MiMa09]. Although these are not quite up to the current fashion standard, they are a good step in the direction of fashionably acceptable and since fashion changes often, they might even be capable of bridging the gap between geeky and fashionably acceptable PERSONAL AND PRIVATE SYSTEMS Augmented reality mobile systems need to be personal, meaning that the displayed information should only be viewed by others if the user allows it. MIT s SixthSense technology [MiMa09] although very advanced, does not offer such privacy to its user due to the use of direct augmentation technique without using any viewing device for protecting the information. Anyone can see the same thing as the user at any time. This poses a dilemma as not needing to wear or carry any extra viewing device for the WUW is an advantage for fashionably acceptable devices; however, it is a problem when it comes to privacy. Systems, such as Costanza et. al. eye-q [CoIn06] or Babak Parviz s contact lens [Pa09], offer such privacy to the user with information that can only be viewed by the user. These systems can also be considered socially acceptable as they are discrete and subtle as well as fashionably correct. However, these systems do not offer the ability of sharing information if the user desires to. A successful AR mobile 31

43 system should provide the user with private information that can be shared when the user wishes to. In addition, AR mobile systems need to be careful not to violate other users and non-users privacy in new ways. Indeed, information that is available and not considered private on social networks, for instance, can be considered private in everyday life. As a result, technologies such as WUW [MiMa09], that make use of online available information about other persons to display them for the user, might face privacy issues due to the way the information is being disclosed TRACKING TECHNOLOGY It is well known that for AR to be able to trick the human senses into believing that computer-generated information coexists with the real environment, very accurate position and orientation tracking is required. As was seen in the AR systems section, the most common type of tracking systems for mobile systems is by combining a few complimentary tracking techniques to comprise the advantages of both and support the disadvantages of the other, which creates hybrid tracking. Outdoor systems mostly make use of GPS and inertial tracking technique with the use of accelerometers, gyroscopes, electronic compasses and/or other rotation sensors, along with some computer vision tracking techniques. GPS systems, although lacking in precision, provide an easy tracking system for outdoor systems that allow for better estimating the position of the user and its orientation once coupled with some inertial sensors. In this way, the user s interest point is narrowed down and allows for easier visual tracking with fewer options. Indoor systems where GPS cannot be used unite visual tracking with inertial techniques only. Visual tracking achieve the best results 32

44 with low frequency motion, but is highly likely to fail with rapid camera movement, such as the ones that will occur with HMDs. On the other hand, inertial tracking sensors perform best with high frequency motion, while slow movements do not provide good results due to noise and bias drift. The complementary nature of these systems leads to combining them together in most hybrid systems. Other systems, such as [LeLe10] and [ArPe09], rely on computer vision for tracking, but most are indoor systems with which the environment can be somewhat controlled. When it comes to visual outdoor tracking, a couple of factors, such as lightning, make tracking extremely difficult. Moreover, some objects present tracking difficulties. One of the most advanced visual tracking mobile systems is Google Goggles; however, this application can only track objects of regular form such as barcodes and books, or places thanks to its GPS and accelerometer that help the application recognize where the user is standing and the user s orientation to narrow down the options. Google Goggles cannot recognize things of irregular shapes such as leaves, flowers or food MOBILE AR SYSTEMS DESIGN IN THIS THESIS The mobile applications developed in the specs of this research use iphone and ipad2 as the interaction and platform devices. These devices have been proven to be fashionably acceptable and easy to interact with, and it is up to the developer to keep the natural way of interaction present in ios devices to keep the interface simple. While it is simple to avoid audio cues in most of these applications, in the case of the hearing augmented reality, the application was designed for hearing impaired people to interact with one other person in a natural and easy to use fashion, and not to use in public places. 33

45 2.3. AUGMENTED REALITY APPLICATIONS While there are many possibilities for using augmented reality in an innovative and original way, this research has cornered four types of applications that are most often the main focus in AR research: advertising and commercial, entertainment and education, medical, and mobile applications of often trivial use and which we chose to focus on iphone and ipad devices. In the following chapter section, this research studies why augmented reality could bring a better solution to some areas, a cheaper one to other areas, or simply create a new type of service. The challenges augmented reality is facing to go from the laboratories to the industry are also discussed ADVERTISING AND COMMERCIAL AR APPLICATIONS Marketers have started to use a form of augmented reality to promote their new products online. Most techniques use markers such as QR Codes that the users present in front of their webcam either using specific software or simply on the advertising company s website. An instance of such use was seen in December 2008, when MINI, the famous car company, ran an augmented reality advertisement in several German automotive magazines. The reader simply had to go to the MINI website, show the ad in front of their webcam and a 3D MINI appeared on their screen as if on top of the magazine, as seen in Figure 2.5. Beyond Reality [BR] decided to use a marker-less technique when they released a 12-page advertisement magazine with each page being recognized by their special software downloadable on their website. This AR advertisement magazine was published as the starting point to their Augmented Reality Games. They see that with such a system, they could add a paid option on the software that would allow the user to access additional content, such as seeing a movie trailer and 34

46 then being able to click on a link to view the full movie, thus turning a magazine into a movie ticket [BR]. Figure 2.5: MINI Advertisement Augmented reality also offers a solution to the expensive problem of building prototypes. Indeed, the industrial companies are faced with the costly need to manufacture a product before commercialization to figure out if any changes should be made and see if the product meets the expectations. When it is decided that changes should be made, which is more often the case than not, a new prototype has to be manufactured and additional time and money are wasted. Creating a virtual prototype and introducing it into the real world would save companies considerable amounts of time and money as prototypes would be able to be changed quicker and created at lower cost since they would no longer involve materials. A group of the Institute of Industrial Technologies and Automation (ITIA) of the National Council of Research (CNR) of Italy [SaMo] in Milan works on augmented reality and virtual reality systems as a toll for supporting virtual prototyping. The ITIA-CNR is involved in the research for industrial contexts and application using VR, AR, realtime 3D, etc. as a support for product testing, development, and evaluation. Some examples of applied research projects where the 35

47 above technology have been applied include motorcycle prototyping ( layout of a factory and an office 2.9), and virtual trial of shoes using the Magic Mirror system discussed next. Figure 2.6: right picture: Picture of a virtual motorcycle prototype (on the right next to a physical motorcycle prototype (on the left) in the real environment; picture: Virtual prototype in a virtual environment Figure

48 Figure 2.8: Virtual office prototype [SaMo] Figure 2.9: Virtual testing of a lighting system [SaMo] Shoes are the accessories that follow fashion trends the most and are renewed annually, especially for those who live in fashion capitals, such as Milan, New York, and Paris. For these people, it is often more important to wear trendy shoes than comfortable shoes. With the Magic Mirror system, the ITIA-CNR of Milan [SaMo] has created an augmented reality system which, combined with high-tech footwear technology for measurement, enables the user to virtually try on shoes prior to buying/ordering them. The user is able to see his/her reflection in the Magic Mirror with a virtual model of the pair of shoes s/he would like to try on (Figure 2.10). The advantage of such a system over 37

49 going to the store to physically try on shoes is that once the user selects shoes for trial, s/he has the ability to change a few details, such as color, heel, and even stitching, before manufacturing needs to be done. Using the Magic Mirror system only requires the user to put on some special socks with spherical, infrared reflective painted-on markers that serve as a tracking system for the Magic Mirror, which is in fact an LCD screen that actually processes the information from the electronic catalogue and inputs data from the client to see if the model chosen is approved, and detects and mirrors movements of the customer. Figure 2.10: User trying on virtual shoes in front of the Magic Mirror [SaMo] To build such a system, the ITIA-CNR in Milan created their library, dubbed GIOVE library. The GIOVE library has been developed, and is continuously under development, not only to approach this specific project, but also for the ITIA-CNR to use as a software library when needed for any application since any type of functionality can be added to the GIOVE library as it has been made from scratch by the ITIA-CNR. The system first has to be calibrated using a bigger version of ARToolKit markers onto which 38

50 one of the small tracking spheres, similar to the ones placed on the socks, is placed in the middle. The marker is then laid on the orange pad that is used to help the system recognize the region of interest where the virtual objects and elements will have to be inserted and tracked. A couple of infrared lights track the IR light reflected from the markers and since the system knows in advance the geometric placement of the markers, the virtual position can be accurately reconstructed. The socks are made of the same orange color, which is the specified color for tracking as it is an unusual color to wear as for pants; of course, the system could be specified any color for tracking. Another challenge the ITIA group had to face with such a system was the lens distortion caused by the camera and which distorted the real image while the virtual image remained perfect. This type of detail is not always perceivable in every system, but in the case of the Magic Mirror, it was too obvious to be ignored. Therefore, the group had the choice to either compensate the real image so that the distortion is not so noticeable anymore or distort the virtual image so it doesn t appear so perfect anymore. The ITIA team chose to compensate their real image using the MatLab software s formula to calculate the degrees of distortion to compensate for. Similar examples to Magic Mirror s use of augmented reality for advertising and commercial applications lies in fully replacing the need to try on anything in stores, thus saving considerable amount of time for clients, which could be used for trying on more clothing and thus increasing the chances for selling. Augmented reality has not fully reached the industrial market in advertisement and commercial application mostly due to minor and major improvements still needed for systems such as Magic Mirror or Cisco s retail fitting room (Figure 2.11). Indeed, for the 39

51 product to be viable in the market, it needs to provide the user with a flawless representation of the original product or prototype: the user should have the complete feeling and impression to be looking at a physical prototype. In the case of Magic Mirror system, this would mean flawless tracking so that when the user looks at the magic mirror they have the feeling that they are wearing shoes and can really see and feel what they would look and feel like. Major improvements and changes to the system would be needed to add feelings of comfort as well, but if the augmented reality application needs to be viable in the market, it cannot be taking away from a real shopping experience. Figure 2.11: Cisco's AR commercial where a client is trying on clothes in front of a "magic" screen ENTERTAINMENT AND EDUCATION AR APPLICATIONS Entertainment and educational augmented reality application include cultural apps with sightseeing and museum guidance, gaming apps with traditional games using augmented reality interfaces, as well as some smart-phone apps that make use of AR for an entertainment and/or educational purpose. 40

52 Cultural augmented reality applications include systems that use augmented reality for virtually reconstructing ancient ruins, such as in [HuLi09] (Figure 2.12), 2.12) or for virtually educating users about a site s history such as in [MaSc04], ], as well w as systems that exploit augmented reality for museum guidance such as in [BrBr07] and [MiMe08]. In [BrBr07] and [MiMe08 [MiMe08], both AR systems are mobile, but [BrBr07]] also uses mobile phones ones as an interface while [MiMe08 [MiMe08] uses a magic agic lens configuration. In I [BrBr07], the authors identified the benefits of using augmented reality as an interface for their cultural applications as: efficient communication with the user through multimedia presentations, natural and intuitive technique and low maintenance and aacquisition cquisition costs for the museum operators presentation technology when smart smart-phones phones are used as the interface. Using a smart-phone phone or another hand hand-held held display seems a more intuitive and natural technique than looking up a randomly assigned number of the object in a small printed guide, especially when most of the population already owns a smart smart-phone. phone. In addition, users can relate better to multimedia presentations brought to them and will more willingly listen, watch and/or read relevant information that is brought to them right away without any additional effort by pointing their phone in the direction of the object rather than if they have to look it up in a guide (Figures 2.13 and 2.14). Figure : Augmented reality view of Dashuifa [HuLi09] [HuLi09 41

53 Figure 2.13: Visitor with guidance system [MiMe08] Figure 2.14:: Mobile phone phone-enabled abled guidance in a museum [BrBr07] [BrBr07 Augmented reality can also be used for learning purposes in the educational field. In fact, augmented reality has rece recently ntly emerged in the field of education to support many educational applications in various domains such as history, mathematics, etc. An instance of such an application is tthe Magic Book developed by [Bi01]] which is a book whose pages incorporate simple AR technology to make the reading more captivating and the learning a little more entertaining. Another example is the GEIST project developed by [MaSc04]] which is a mobile outdoor augmented reality system for assisting users in learning history through a st story-telling telling game where the user can release ghosts from the 42

54 past to tell them stories of their time and thus teach the user in a fun and more memorable way rather than regular classroom teaching experience. Augmented reality gaming applications present many advantages other physical board games with, for example, the ability to introduce animations and other multimedia presentations. The ability to introduce animations can not only add excitement to a game, but it can also serve as a learning purpose with, for example, indications to help players learn a new game or serve as referee when users are making an invalid move. An instance of such an application is ARCC, a Chinese checkers game that uses three fiducial markers and cameras fixed on the roof to track the markers [CoKe04]. Two of their fiducial markers are used for positioning the checkerboard and the third is used for manipulating the game pieces. Having only one tool for manipulating the pieces allows the developers to adapt their setup to different types of games as all they will have to change is the GUI (Graphic User Interface), which represents the board game, and the game s logic (Figure 2.13). Figure 2.15: ARCC [CoKe04] Beyond Reality [BR], the first to introduce a marker-less magazine, presents two board games: PIT Strategy and Augmented Reality Memory. In PIT Strategy, the player 43

55 is the pit boss in a NASCAR race and must act accordingly to given weather conditions, forecasts and road conditions. In Augmented Reality Memory, the player turns a card and sees a 3D object, turns a second card and sees another 3D object. If they match, a celebration animation will appear; otherwise, the player can keep looking for matches. These two games are still under development and further information can be found on [BR]. Augmented reality entertainment and educational applications have not fully reached their potential to enter the industrial market. Once again, this is mostly due to technological advances such as tracking systems not being advanced enough yet. For example, the few museum guidance systems developed can only be used in the specific museum or for the specific exhibition they were developed for and cannot be reused. This is caused by both systems relying mostly on the museum or exhibition organization to help the application in recognizing the artifacts as opposed to detecting the artifacts solely using computer vision. So why hasn t computer vision been used to recognize the objects instead of relying mostly on knowing the user s position in the museum? As was seen in the section 2.1.1, some objects have irregular forms and although it might seem easy for us to recognize them, it is very hard for a computer to detect what these objects are, in the case of most artifacts, as they do not possess any regular shapes and recognizable features. Paintings do not present such a big trouble for systems such as Google Goggles due to their regular shape, but other objects, such as objects of modern art that have irregular geometry make it very difficult to track a defined feature [CaFu11]. 44

56 MEDICAL AR APPLICATIONS Most of augmented reality medical applications deal with image guided and robot-assisted assisted surgery. As a result, significant research has been made to incorporate augmented reality with medical imagin imaging g and instruments incorporating the physician s intuitive abilities. Major breakthroughs have been provided by the use of diverse types of medical imaging and instruments, such as video images recorded by an endoscopic camera device presented on a monitor viewing the operating room site and inside the patient. However, these breakthroughs have also been proven to limit the surgeon s natural, intuitive and direct 3D view of the human body as the surgeons now have to deal with visual cues from an additional eenvironment nt provided on the monitor [BiWi07]. [BiWi07 Augmented reality can be applied so that the surgical team can see the imaging data in real time while the procedure is progressing. [BiWi07 [BiWi07] introduced an augmented reality system for viewing through the real skin onto virtual anatomy using polygonal surface models to allow for real time visualization. The authors also integrated the use of navigated surgical tools to augment the physician s view inside the human body during surgery (Figure 2.14). Figure 2.16:: Bichlmeier et. al. system for viewin viewing g through the skin [BiWi07] [BiWi07 45

57 Teleoperated robot-assisted surgery provides the surgeons with additional advantages over minimally invasive surgery with improved precision, dexterity, and visualization [CaFu11]; however, implementing direct haptic feedback has been limited by sensing and control technology and thus is restricting the surgeon s natural skills. The lack of haptic feedback has been proven to affect the performance of several surgical operations [BeOk04]. In [AkRe06], the authors propose a method for sensory substitution to provide an intuitive form of haptic feedback to the user. The force applied by the surgeon is graphically represented and overlaid on a streaming video using a system of circles that discretely change colors across three pre-determined ranges (Low Force Zone: green, Ideal Force Zone: yellow and Excessive Force Zone: red) according to the amount of being forces detected by strain gages (Figure 2.15). The need to reduce surgical operations is not the only one to depend upon seeing medical imaging data on the patient in real time: the necessity to improve medical diagnosis also relies on it. In this research field, the ICAR-CNR group of Naples, Italy, is working on an interactive augmented reality system for checking patients hands and wrists for arthritis by overlaying in real time 3D MR imaging data directly on top of the patient s hand. Since arthritis signs are strongly associated with pain intensity and so require a direct manipulation of the hand and wrist region to be diagnosed, the system needs to support physicians by allowing them to perform morphological and functional analyses at the same time [GaMi10]. Augmented reality could also be used to manage clients medical history. Imagine if all a doctor had to do to check a patient s medical history was to put on a head mounted 46

58 display or open up an app on an ipad and look over the patient to see virtual labels showing the patient s past injuries and illnesses. The use of augmented reality in the medical field to provide better solutions to current problems than already existing solutions is infinite. In [LuKl05], the authors use augmented reality to provide a low cost and smaller in size solution to the post-stroke hand rehabilitation problem that has the potential of being used in clinics and even directly at home. In [JuBo04], the authors use augmented reality to help patients fight against the phobia of cockroaches and thus show a new possibility for AR to be used as a treatment to psychological disorders as well. Figure 2.17: Sequential screenshots of knot tying task [AkRe06] Additionally, augmented reality can be used to assist the impaired such as to support the visually impaired through augmented reality audio navigation or supporting the hearing impaired through the use of visual cues. In [MaSo10], the authors developed a multimodal feedback strategy for augmented navigation of the visually impaired. The 47

59 feedback device consisted of a Wiimote which provided audio and haptic feedback to operate as a guiding tool and warn the user when they were getting close to walls and other obstacles. In this research, an augmented reality hearing application was designed to help the hearing impaired through conversations by overlaying the interlocutor speech in an easy to understand and intuitive manner. Unfortunately, on top of facing a few technological advance issues such as displays and tracking issues, medical applications also face privacy concerns. Display challenges also arise mostly from the fact that the preferred type of displays to use for AR medical applications are head mounted displays as they allow physicians to not only use both hands, but they also make it easier to track the doctor s current view in order to augment the right surfaces. However, HMDs are rather challenging to implement for medical applications, as accurately placing and applying depth perception to 3D models and not projecting the augmented image over the surgeon s tools are difficult issues to overcome. Another possible display that could be used would be spatial display to allow the whole surgical team to see the same thing at the same time; however, it then becomes very hard to track where the surgeon is looking and what the desired place for augmenting is. Privacy and security concerns also arise in the medical field when discussing the treatment of confidential medical history of patients that need to be handled in a secure yet simpler manner than they are currently used. Other types of issues that medical applications in augmented reality will most likely have to face are the problems that arise with retraining the medical staff for using new tools. Most augmented reality applications aim at simplifying the use of AR tools 48

60 such that they correspond to what the physician is used to, such as in [AkRe06] when the feedback system developed by the authors did not require the surgeons to truly learn how to use it as the application was easily integrated onto the da Vinci Surgical System that most surgeons know how to use. Even with such a system, the surgeons still have to get used to this type of haptic feedback interface, although the training can be considered short and rather inexpensive in comparison to the gain. However, there are other augmented reality systems that would require complete retraining of the medical staff to interact with the application. For instance, applications that require users to interact with a 3D input device as opposed to 2D input devices such as a computer mouse, will present some training issues as they might be too costly for the medical field to judge them viable to adapt MOBILE (IPHONE) AR APPLICATIONS Many augmented reality mobile applications for iphone already exist; however, very little development for ipad has been done due to the lack of camera in the first generation of ipad. Moreover, most iphone applications either have an entertainment and/or educational purpose or a navigational and/or informative purpose, such as orienting the user. Examples of such applications include WikitudeDrive, which is a GPS-like application that allows the user to keep his/her eyes on the road while glancing at the GPS (Figure 2.16); Firefighter 360, which has an entertainment purpose that permits the user to fight a virtual fire like a real firefighter (Figure 2.17); and Le Bar Guide that has a navigational function to guide the user to the nearest bar that serves Stella Artois beer (Figure 2.18). Websites such as Mashable, the Social Media Guide [Ma09] and 49

61 iphoneness [IN10] have all come up with the best augmented reality applications for iphone. Figure 2.18: Wikitude Drive Figure 2.19: Firefighter 360 app Figure 2.20: Le Bar Guide app 50

62 Due to the relatively new side of adding augmented reality to mobile applications, there currently are not many libraries, kits or codes available for iphone program developers to add some augmented reality to their application. Studierstube Tracker and Studierstube ES support iphone platforms; however, they are not open sources [HAR10]. This research found two sources to help iphone developers in using AR for the mobile application. [SS10] is a slideshow presentation that shows the viewer how to develop augmented reality applications on iphone with codes for retrieving GPS position, using the compass and accelerometer, and getting the image from the camera. iphone ARKit [SC09] is a small set of classes that can offer developers augmented reality on any iphone application by overlaying information, usually geographic, over the camera view. The iphone ARKit s APIs were modeled after ios Map Kit, which is a framework for providing an interface to embed maps directly into the windows and views and to support annotating the map, adding overlays, and performing reverse-geocoding lookups to determine place mark information for a given map coordinate. Augmented reality mobile applications are one of the only augmented reality applications that we can find in the general public. However, even these applications are facing a few challenging issues of their own. There are, for instance, problems due to GPS systems not being accurate enough for applications that require very precise placement of virtual tags. There are also issues with working with limited hardware capability when we require image processing power. In the iphone category, there were challenges due to accessing the video APIs as Apple would not open its APIs enough for the developers to be able to get in there and work with the video. However, with the release of ios4 and upcoming ios5, augmented reality sees more doors opening for 51

63 augmented reality applications on mobile phones: the developers now have the ability to access the camera image s APIs, enhanced image tracking, gyroscopic motion sensing, and a faster processor as well as high resolution display. Similar to mobile systems, mobile applications also face social acceptance issues, as the applications also need to be subtle and discrete. Mobile applications should not be able to make random noises at inopportune times for the user. No matter how accustomed to mobile phones our society has become, it is still considered to be rude and obnoxious when someone is on the phone in a public place: when a person s phone is ringing in public, the first reflex for this person is to search for their phone to turn off the sound and then check who is calling or what the reminder is. Of course, social acceptance has the advantage of changing through the generations much like fashion AUGMENTED REALITY S FUTURE DIRECTION Augmented reality is still in its infancy stage, and as such, future possible applications are infinite. Advanced research in AR includes the use of head-mounted displays and virtual retinal displays for visualization purposes as well as the construction of controlled environments containing any number of sensors and actuators. MIT Media Lab project Sixth Sense [MiMa09] is the best example of AR research. It suggests a world where people can interact with information directly without requiring the use of any intermediate device. Other current research also includes Babak Parviz AR contact lens [Pa09] as well as DARPA s contact lens project [Wi08], and MIT Media Lab multiple research applications such as My-Shopping Guide [MeMa07] and TaPuMa [MiKu08]. Parviz s contact lens opens the door to an environment where information can only be viewed by the user, which can also be accomplished by using glasses instead of 52

64 contact lenses; however, in both cases, the advantage over using a cell phone, for instance, is that no one else but the user can see the information projected, thus keeping it very personal and private. Cisco, on the other hand, has imagined a world where AR could be used for replacing the traditional fitting rooms by virtually trying on clothes thus saving time and providing the ability to try on more clothes which would increase chances for selling. Figure 2.21: From top to bottom and left to right: DARPA's contact lens project [Wi08], MIT's Sixth Sense [MiMa09], Contactum's AR solutions [Co06] This research also foresaw that augmented reality brings the possibility of enhancing not only our current senses, but also complementing missing ones. In this thesis, we have designed a hearing augmentation AR application where hearing-impaired users can see visual cues of what is being said to them in a natural and intuitive way to understand. 53

65 We believe that new mobile devices, such as iphone, Android-based devices, and ipad are not well used in augmented reality. Indeed, most of the current applications include gaming, entertainment and education, and while most already believe that these are "amazing apps" [Ma09], the author of this research believes that they have not yet reach their potential in terms of possibilities, opportunities and usefulness. Figure 2.22: From top to bottom and left to right: Examples of futuristic augmented reality and Babak Parviz's contact lens [Pa09] Even the future is not far from challenges for augmented reality. We see social acceptance issues, privacy concerns, and ethical concerns arising with the future of augmented reality applications in the industry. Social acceptance mostly arises from mobile devices with the need for the devices to be subtle, discrete and unobtrusive, as well as fashionably acceptable as was discussed in section 2.2.2, but also with systems that will require retraining of the personnel and staff in order to be utilized. This research has shown that this might be the case with some 54

66 medical applications and that the health system might decide against the use of augmented reality if they think that the retraining is too costly to be viable. A system for easy integration of such system will need to be developed to avoid such issues. Privacy concerns arise not only with medical applications, but also with technologies that have the ability to detect and recognize people. For instance, MIT s WUW technology video presentation [MiMa09] has an application that is capable of recognizing people and displaying information about these persons for the user to see. Although this information could be found online by anyone on websites such as social networks, it will raise problems as a lot of people will not appreciate being spotted that way, even if they do not mind having this information available online for anyone to see. A solution for applications like ones similar to WUW s recognizable feature would be to create a social network for the users of this technology for them to decide whether or not they want to be recognized or what information about them they allow to be displayed. Non users of this technology should not be recognized by the system unless they allow it by joining the social network. When it comes to ethical concerns, the apprehension mostly comes from the fact that people tend to get carried away by technologies with things they see in Hollywood movies. We do not know where to put down limits for the use of technology and keep researching new type of applications and techniques as we see the potential grow. However, with augmented reality, it will be very important for the developers to remember that augmented reality aims at simplifying the user s life by enhancing, augmenting the user s senses, not interfering with them. For instance, when reading the comments following Babak Parviz s contact lens article [Pa09], there were suggestions 55

67 from readers including "tapping the optic nerve" or "plugging in to the optic nerves and touch and smell receptors" and suggested that these would eventually be "more marketable approaches" and a "much more elegant solution". Although the commentators do realize that as of today research simply does not know enough about the human nervous system to do such things, the fact that thoughts and ideas have started to emerge in this direction raises questions about ethical concerns: is having technology coming in direct interaction with our human senses something we want? As was mentioned earlier, augmented reality is about augmenting the real environment with virtual information; it is about enhancing/augmenting people's skills and senses not replacing them with others [CaFu11]. 56

68 3. USING THE IOS PLATFORM TO BUILD AN AUGMENTED REALITY APPLICATION 3.1. MOTIVATION This thesis studies the use of the ios platform for augmented reality by designing an augmented reality translating application dubbed itranslatar. The application s first phase design, which does not involve augmented reality but nevertheless brings knowledge about the ios platform capabilities, was implemented as part of this research and discussed in the next sections. In addition, techniques for using the AV Foundation framework to show and extract video information are discussed in section 3.4 as part of the final design of the augmented reality translation application and constitute part of this thesis focus on implementing augmented reality on the ios platform FIRST PHASE DESIGN The design of a translation application using pictures relies both on an OCR (Object Character Recognition) engine and a translation tool. The author decided to use the Tesseract OCR and Google Translate API as standard integration tools for the OCR and translation, respectively. Both of these systems are open sourced and their implementation into the ios platform is discussed in section 3.3. In addition, the usage of both Tesseract OCR and Google Translate is relevant to not only the first phase design of itranslatar, but also to the final phase design that involves augmented reality. 57

69 itranslatar is composed of three views in what is referred to as a tab controller in ios: a view for translating text through pictures, one for translating text through manual entry and one for selecting the target language. In the first phase of itranslatar, this design offers the user the option to either take a picture or manually enter the text for translation. When the user decides to manually enter the text, a query to the Google Translate API will be directly sent and results are outputted without needing any further steps; since this is fairly simple and not in the scope of this thesis, this option will not be further discussed. The PictureViewController handles the picture translation and is the most complex view controller; it will be discussed in more detail in section A general system diagram of the first phase design picture translation can be seen in Figure 3.1. When the user decides to do a picture translation, that is, to obtain a translation of the text present in an image, the user gets the option to either take a new picture or select one from the photo libraries. The diagram in Figure 3.1 starts when the user has either selected an image or taken a new one. The user first gets the option to resize the image and arrange it properly in the given textbox, thus giving the best possible image to the OCR. After the user is done resizing the image, it will automatically be saved into the user s main photo library as a convenience. Once the user is satisfied with the image, the image is processed through the OCR to find and parse the text present in the image. Tesseract OCR offers the ability to input not only binary images, but also color images, although it should be noted that binary images produce better and faster results. Once the OCR is done processing and text is returned in the form of a string, this string is passed in as the query parameter to Google Translate API, to first detect the language and then translate it in the desired language as selected by the user before 58

70 returning the final string in a new view controller. Figure 3.2 shows the steps a user would take to obtain a translation from a colored picture. Figure 3.1: itranslatar first phase design picture translation general system 59

71 (a) (b) (c) (d) Figure 3.2:: itranslatar - Screenshots: (a) language selection; (b) image resizing and cropping; (c) input image; (d) resu result lt of OCR + translation (from French to English, in this case). 60

72 PICTURE VIEW CONTROLLER The PictureViewController in itranslatar first phase design is the most complex view. It is responsible for reading, resizing and saving the input image before performing OCR, handling the Tesseract OCR API, and interaction with the TranslationViewController which is responsible for detecting the source language and translating it into the appropriate target language. Source language refers to the input string s language and target language refers to the desired translation language or the output string s language. Like many libraries often needed for implementing augmented reality on the ios platform, Tesseract OCR is a foreign library that requires creating the appropriate static libraries in order to correctly use it in iphone/ipad applications. This can be done using a Shell script ran through the terminal. Once the static libraries are created, the new functionality can often be used right away. In the case of Tesseract OCR, access to the library functions required using C++ code in addition to the regular objective C code used for ios development. This can be achieved by renaming the implementation file also known as.m file in Xcode to.mm file, which implies mixing C++ code with objective C code. The PictureViewController is composed of a UIImageView for displaying the picture chosen by the user, a TessBaseAPI ivar for executing the OCR, an NSString for storing the result of the OCR, a ZoomableImage ivar for resizing and cropping the image and a TranslationViewController for showing the translation text in the image selected. The PictureViewController is also composed of a few private methods to handle image modification and OCR on a background thread. Other methods used simply involve 61

73 handling translation or user actions and will not be discussed in details as they are out of the scope of introducing the ios platform methods and techniques for possible implementation of augmented reality. Image modifications such as allowing the user to resize and crop the image are handled by performimageoperationsbeforeprocessingmediawithinfo as seen in Figure 3.3. This method performs in the background scaling, cropping and resizing of the image based on the user s prior changes. Once this method is done performing, it calls a new method for performing OCR on the new image in the background. Figure 3.3: Image operation set up snapshot OCR is performed on a background thread by readandprocessimage seen in Figure 3.4. As seen, the TesseractRect is the Tesseract function called for handling OCR, 62

74 it returns a char of the text found in the image that is then converted into an NSString and stored for translation. Once OCR is done performing, the method exits out of the background threading to perform translation. Figure 3.4: Image processing for OCR snapshot Translation is performed using the Google Translate API and JSON library to extract the results. In the non-augmented reality design of itranslatar, this is not performed in the background as the results are shown on a different view controller as seen in Figure 3.2 (d). However, when adapting itranslatar to augmented reality, this step will need to be implemented in the background so as not to affect the user s experience. Furthermore, an additional function of Tesseract API will be introduced to detect the text position in the image necessary to overlay the translation in the place. These changes are discussed in the next section. 63

75 3.3. FINAL PHASE DESIGN The final phase design of itranslatar refers to the transformation of itranslatar into an augmented reality translation application inspired by Word Lens [QV11], which mainly involves (near) real-time translation of camera video feed as opposed to simple image frame. One of the main steps to such a transformation is to adapt the PictureViewController into a CameraViewController that shows the device s video feed in real time as opposed to needing the user to take or select a picture image. This transformation involves the implementation of the AV Foundation framework of the ios platform, which is discussed in section While the implementation of the AV Foundation is discussed, the final phase design of itranslatar was not implemented. After the CameraViewController is implemented, the frames need to be analyzed in a similar fashion as was done for the PictureViewController (Figure 3.1); however, since the desired result is to obtain (near) real-time translation, these steps will not only need to be performed in the background, they need to be performed in an optimized manner such that the heavy operations such as OCR only occur when necessary, that is, the frames have a significant change that indicates new text to be detected, and the translation of the text occurs only once each time a new text is detected. A general system diagram of the video feed text translation can be seen in Figure 3.5. The main user interface for the CameraViewController presents the video feed to the user. At this point, it is important to note that the final design of itranslatar is still composed of three views as in the first design and that the views for translation of manually typed inputted text and language selection were not changed in the final design and are not discussed as they are fairly simple views that are not covered by this thesis. In 64

76 the background, the system takes in the first frame, extracts it into an image to analyze the image s general feature (such as dominant color), performs OCR on the image, and translates the resulting text from OCR into the desired target language. The image s general features are stored as a means to identify whether the next frame is different enough from the original frame OCR was performed on. If the image is considered different enough based on a first threshold determined through testing, OCR and translation are performed on the new frame and the new frame s features are stored in place of the original frame s features. In addition, when OCR is performed on a frame, both text and position of the detected text in a rectangular bounding box are conveniently returned by the Tesseract OCR API engine through using TesseractToBoxText to obtain the rectangular bounds of the detected text. This information will be used to overlay the translation of the detected text over the original text in the video feed. If the image was not considered different enough, a second threshold (determined through testing as well) is calculated to determine if the text position has moved enough to recalculate it again using TesseractToBoxText. This technique will allow optimization of the code by performing OCR, detected text position, and translation only when necessary. However, it is important to note that when/if the design is implemented, this technique is subject to changes based on user perception to provide the best possible user experience. 65

77 Figure 3.5: itranslatar final phase design video feed translation general system AV FOUNDATION FRAMEWORK IMPLEMENTATION AV Foundation is the framework necessary to access the camera API and extract video frames for processing. It was originally introduced in ios 2.2 for playing audio 66

78 content to be later improved in ios 3.0 to support and manage audio recording and information to finally allow for media editing, video capture, etc. in ios 4.0. To access video frames, AV Foundation requires the use of an AVCaptureSession to capture frames and an AVCaptureVideoPreviewLayer to show the camera output. The AVCaptureSession is composed of one or more input devices and one or more output devices. The input devices include the use of both back and front camera (if applicable) and audio. AVCaptureVideoDataOutput is the ios class used for processing uncompressed frames from the video capture necessary to implement augmented reality. It requires setting up a sample buffer and a queue to run the capture on and invoke the callbacks. It is also possible to alter the session frame sizes to either optimize performance or enhance viewing. When a new video sample buffer is captured, the sample buffer delegate method captureoutput:didoutputsamplebuffer:fromconnection: is called where the frame can be processed. If the queue is blocked when new frames are captured, those frames will be automatically dropped after a certain amount of time; however, in order to manage potential memory usage increases, it is better to optimize the code to process frames when necessary, as suggested in sections 3.3 and 4.2. The code used in the next chapter to implement AV Foundation framework to extract and process camera frames is shown in Figure

79 Figure 3.6: AV Foundation framework video frame analyzes set up snapshot 68

80 4. DESIGN OF AN AUGMENTED REALITY HEARING APPLICATION 4.1. MOTIVATION This research aims at showing that augmented reality is not only fit to augment the sense of sight, but as the potential to be applied to all senses. In addition, the author of this thesis believes that state-of-the-art technologies, such as augmented reality, often fail to see how they can help impaired people in their everyday lives and instead focus sometimes too much on improving non-impaired people s lives DESIGN This thesis researched the design of a hearing augmented reality application for helping hearing-impaired people in conversation with non-deaf people in the case the user does not know how to read lips and the interlocutor and/or user does not know sign language. The detected speech is then outputted next to the interlocutor s head in a cartoon-like bubble, so as to be simple and natural of use. The device chosen for this application is an ipad second generation as it is equipped with a camera, is rather mobile and is believed by the author to be a great platform for augmented reality. The design of such an application relies on a speech recognition system as well as a face detector. Since speech recognition is not one of the supported APIs of ios4, and using a backend server such as the Dragon Naturally Speaking [NC] would be too expensive for this type of project and might introduce additional latency due to network 69

81 that is not welcome for augmented reality, the author of this research decided to use the open source OpenEars library, which offers the ability to introduce a five thousand word dictionary with speech detection and recognition of the five thousand words happening on the ipad itself. The face detection is performed using OpenCV face detection algorithm. It is executed in the background on the frame to process when speech has been recognized; while this frame is being analyzed for face detection, the video feed continues normally. Once the frame has been analyzed, the next frame from the video feed goes through the face detection and the frames that happened in between are not analyzed. This method allows enough time for the algorithm to perform without slowing performance of the video feed and without making the app run out of memory. Indeed, the algorithm performs slower than the video feed, but not slow enough for the user s position to have changed enough in the allocated time, thus making it barely noticeable. A general system diagram of the application can be seen in Figure 4.1. The main user interface presents the video feed to the user. In the background, the system continually listens for speech and if speech is detected, it searches frames every so often for a face as described previously. While a frame is being analyzed the video feed continues uninterrupted and once it is done analyzing the next frame in the video feed is analyzed in turn. If a face is found, its position is returned and the corresponding text from the speech recognition algorithm is outputted next to the detected face. When speech is detected, a Boolean (speechdetected) allowing for the next frame in the video to be processed is set to YES (equivalent of TRUE in objective-c); otherwise, the video frames keep showing the camera feed uninterrupted (Figures 4.2 and 4.4). When speech is detected, a timer is started for a minute (60 seconds) at the end of which the 70

82 speechdetected Boolean is set back to NO. While a frame is being processed, a Boolean (frameanalyzed) interrupting analysis of the next frame is set to YES and once the previous frame is done being analyzed for face detection (results are returned from the face detection method), the Boolean is set back to NO (equivalent of FALSE in objective-c) thus allowing the next frame to be analyzed for face detection (Figure 4.3). This technique allows optimization of the code by performing only the face detection when necessary without affecting the user s experience. In addition, to the blocks seen in Figure 4.1, the prototype of the application implemented for this thesis added the ability to use Google Translate to translate the recognized speech. This feature being an addition to render the system useful for the traveler as well as the hearing-impaired, the implementation of Google Translate s API is not further discussed. 71

83 Figure 4.1: ihear General System Design 72

84 Figure 4.2: Speech detected snapshot Figure 4.3: Speech detected and frame analyzed for face snapshot 73

85 Figure 4.4: Frame processing snapshot 74

86 Next sections 4.3 and 4.4 discuss in more detail the speech recognition system and the face detection algorithm used, respectively OPENEARS SPEECH RECOGNITION SYSTEM OpenEars is an open source ios library for implementing English language speech recognition on the iphone and ipad. It uses CMU Pocketsphinx, CMU Flite, and MITLM (MIT Language Modeling) libraries, but in the specs of this research, the author is only interested in the CMU Pocketsphinx discussed in section OpenEars has many interesting capabilities including the ability to: Listen continuously for speech on a background thread, while suspending or resuming speech processing on demand, Dispatch information to any part of your app about the results of speech recognition and speech, or changes in the state of the audio session such as an incoming phone call or headphones being plugged in, Support JSGF grammars; however, this model is not used in the scope of this research, Dynamically generate new ARPA language models (which is the language model being used and is discussed in section 4.3.1) in-app derived from input from an NSArray of NSStrings, Switch between ARPA language models on the fly, Be easily interacted with via standard and simple Objective-C methods, Control all audio functions with text-to-speech and speech recognition in memory instead of writing audio files to disk and then reading them, 75

87 Drive speech recognition with a low-latency Audio Unit driver for highest responsiveness, Be installed in a Cocoa-standard fashion using static library projects that, after initial configuration, allow for targeting or re-targeting any SDKs or architectures that are supported by the libraries by making changes to the main project only. The OpenEars library presents the advantage of being able to use objective-c calls to C functions of PocketSphinx and the option to add uses of on the fly creation of new Language models through the use of the MITLM Language Modeling tool, which is out of the scope of this thesis. While this thesis focus is not on speech recognition or language modeling, the author provides an understanding to the speech recognition and language modeling algorithms used by the PocketSphinx library used in OpenEars ARPA LANGUAGE MODEL Speech recognition and translation into text is based on comparing the sounds heard over the microphone with what is called a language model or grammar consisting of the words the app knows and has the ability to recognize, their pronunciations as understood by the acoustic model, and some rules about how likely those words are to be said or about what grammatical contexts the words can be understood in. In the case of this thesis, we used an ARPA formatted language model. The acoustic model characterizes how sound changes over time. Each phoneme or speech sound is modeled by a sequence of states and signal observation probabilities, which is the distribution of sounds that might be heard in that state. For this project, the 76

88 PocketSphinx is implemented using a five-state phonetic model which means that each phone model possesses five states. Language models used in decoders capture the constraint inherent in the word sequences found in a corpus, which can simply be a.txt file consisting of the words to be recognized. This information is used to restrain search of a matching word and as a result improve recognition accuracy. The language model used is statistical as it is the customary way to configure Sphinx decoder, which is the decoder PocketSphinx inherits from. The ARPA format lists 1 2 and 3 grams along with their likelihood in the first field and a back-off factor in the third field. Language models such as ARPA formatted language models are created by language modelers. The language modeler used in this project is lmtool and it uses the quicklm, a statistical language modeler. Quicklm is meant for corpora of less than 50,000 words, which is about more than ten times the maximum of words we can use with OpenEars. It applies a uniform ratio discount to all n-grams which can be varied to control the looseness of the model. This technique has been proven to produce better recognition performance on small corpus. 77

89 Figure 4.5 : ARPA file format LMTOOL LANGUAGE MODELER The lmtool is the language modeler used to create the language model used for this project. It attempts to give a pronunciation for every word-like string provided in the corpus, which corresponds to every sequence of characters delimited by whitespaces. Pronunciation is determined in two ways: by querying a large pronunciation dictionary, called cmudict, and by applying letter-to-sound (LtoS) rules to words not found in the dictionary. In fact, the first step if the word is not found in the cmudict pronunciation dictionary is to apply a few affix rules to the unknown word to see if there is a base-form available that can be deterministically modified such as for plurals or possessives, which are easily predictable. Then, if nothing is found, the LtoS rules are applied. These rules are based on systematic patterns in the English language. 78

90 The letter-to-sound rules automation process involves a few steps. First, the lexicon of choice must be pre-processed into a suitable training set and a set of allowable paires of letters to phones must be defined. Next, one needs to construct the probabilities of each letter-phone pair and assign letters to an equal set of phones before extracting the data based on the letter suitable for training. Finally, CART models for predicting phone from letters and contexts needs to be built and additionally, a lexical stress assignment model can be added if necessary to help in the phonetics assignment. However, there are a few technical limitations to pronunciation generation using the lmtool, which are that word tokens are limited to 35 characters and these word tokens may only contain ASCII characters. These limitations are caused by the lmtool being currently Anglo-centric and thus non-english words, such as loan words, proper names and technical terms, would result in pronunciation guesses that can often be wrong. The lmtool produces two files: a dictionary file (.dic) which represents a list of words with a sequence of phones and a language model file (.lm) that describes the likelihood, probability, or penalty taken when a sequence of collection of word is seen. The language model file for this project is a trigram model which represents the probabilities of sequences of three words, two words, and one word, are combined together using back-off weights POCKETSPHINX LIBRARY PocketSphinx s speech recognition algorithm uses the hidden Markov model (HMM), which is a statistical approach to pattern recognition. It uses both a language model file and a dictionary file to compute a hypothesis output based on the speech 79

91 heard by the device. An overview of a general speech recognition system is shown in Figure 4.6. Figure 4.6: General Speech Recognition System The goal is to find the most likely word sequence W, given a series of N acoustic vectors, Y N = {y(1),, y(n)} arg max, which yields to the following equations after applying Bayes rules: arg max arg max Where the most likely word sequence is invariant of the likelihood of the acoustic vectors p(y N ). The search for the optimal word sequence makes use of two distributions: the likelihood of the acoustic vectors given a word sequence p(y N W), generated by the acoustic model, and the likelihood of a given string of words P(W), generated by the language model, as shown in Figure 4.6 [St03]. As previously discussed in sections and 4.3.2, the language model file (.lm) represents the syntactic and semantic content of speech and the dictionary file (.dic) 80

92 handles the relationship between words and features vectors. In Figure 4.6, the acoustic models are formed phones and the lexicon represents the mapping from the phone to word sequences. In PocketSphinx, this is handled by the dictionary file (.dic). PocketSphinx is an open source speech recognition engine developed at Carnegie Mellon University (CMU) that inherits features from the Sphinx-II API. It is a real time continuous speech recognition system for hand held devices that operates on an optimized platform with the Sphinx-II main algorithm optimized which is the Gaussian mixture model (GMM) computation. The PocketSphinx platform uses fixed point arithmetic computation which introduces some error rates due to Fast Fourier Transforms and Gaussian computations; however, floating point operations, although more accurate, are too slow to be viable on a real time system and the additional error rate introduced is minimal. Gaussian mixture model is a parametric probability density function represented as a weighted sum of Gaussian component densities. They are commonly used as a parametric model of the probability distribution of continuous measurements or features in biometric systems such as speech recognition systems. In speech recognition, GMM measures acoustic characteristics.,, where x is a D-dimensional continuous-valued data vector (i.e. measurement or features), wi, i = 1,...,M, are the mixture weights, and g(x µi,_i), i = 1,...,M, are the component Gaussian densities [Re08]. The GMM evaluation used in the PocketSphinx framework is divided into four layers of computation: the frame layer, which includes all GMM computation, the GMM 81

93 layer, which corresponds to the computation of a single GMM, the Gaussian layer, which corresponds to the computation of a single Gaussian, and the component layer, which corresponds to computations related to one component in the feature vector assuming a diagonal covariance matrix. Each of these layers is improved in the PocketSphinx to achieve optimal speed OPENCV FACE DETECTION ALGORITHM The face detection method used for detecting the presence of faces in frame first involves a conversion from objective-c UIImage image object to OpenCV IplImage image object (Figure 4.8). The main method for detecting the presence of a face in frame, seen in Figure 4.7, returned the face position of the first detected face in terms of the coordinates for the IplImage created from the code in Figure 4.8. IplImage coordinates have an origin in the bottom right corner and dimensions of 360x480, while ipad and iphone UIViews have an origin at the top left corner and in the case of this thesis prototype a UIView dimension of 780x1004. Therefore, when using the method in Figure 4.7 to detect the presence and position of a face in real time, the corresponding coordinates must be converted to iphone/ipad coordinates as follow before using them: 82

94 Figure 4.7: Face detection method 83

95 Figure 4.8: Creating an OpenCV image object from Objective-C image object OPENCV LIBRARY BRIEF INTRODUCTION OpenCV is an Open Source Computer Vision Library written in C/C++ whose programming functions mainly aim at real time computer vision. OpenCV was originally developed by Intel and has been supported by Willow Garage since mid The library was developed as a cross-platform and focuses on real-time image processing. 84