RESEARCH ON SPOKEN LANGUAGE PROCESSING Progress Report No. 29 (2008) Indiana University A Software-Based System for Synchronizing and Preprocessing Eye Movement Data in Preparation for Analysis 1 Mohammad B. Afaneh, Visal Kith, and Tonya R. Bergeson 2 Speech Research Laboratory Department of Psychological and Brain Sciences Indiana University Bloomington, Indiana 47405 1 This work was supported by NIH-NIDCD Research Grant R21DC06682 and Training Grant T32DC00012. We thank Luis Hernández for his insightful comments on this project. 2 Babytalk Research Laboratory, Department of Otolaryngology Head and Neck Surgery, Indiana University School of Medicine. Correspondence concerning this article should be addressed to Tonya R. Bergeson, Ph.D., Department of Otolaryngology Head and Neck Surgery, Indiana University School of Medicine, 699 West Drive RR044, Indianapolis, IN 353
AFANEH, KITH, AND BERGESON A Software-Based System for Synchronizing and Preprocessing Eye Movement Data in Preparation for Analysis Abstract. When upgrading or adding a new component or system to an existing system in a research lab, the problem of incompatibility often arises. In this paper, we present a software-based solution for integrating and synchronizing an eye-tracking system with another software system used for stimulus presentation in an infant speech research lab. The algorithms developed and implemented in this system originate from different tracks of images and data processing algorithms. The solution presented is specific to our particular system set up. Nevertheless, it can be easily applied to any other similar setup using an eye-tracker system with stimuli presentation software. Introduction Traditionally, researchers of visual attention and perception have made use of techniques such as monitoring a live view of the subject's head and face to get a rough idea of their gaze direction (e.g., looking right, left, or center). With recent advances in eye tracking technology, however, eye tracking systems have been introduced and integrated in visual perception laboratories where both high accuracy and high resolution are necessary to investigate looking behavior towards a variety of visual scenes. The use of eye trackers has several advantages over the traditional methods. The first and most important advantage is the increased accuracy over traditional methods (up to 0.5 degrees visual angle). Second is the measurement of other types of useful information in addition to the direction of gaze (e.g., pupil diameter, blinks, head position, and pupil position). Another very important advantage is the ability to have external software analyzing eye tracker output data. By using such software, the process of detecting blinks, fixations, dwell times and saccades can be done automatically. A typical eye tracker consists of a camera, which is encircled by a ring of infrared LEDs, and a control unit in which the image captured is processed before being transferred to a PC or other interface device. The ring of LEDs illuminates the eye, and when placed on the axis of the camera lens it produces an interesting effect on the pupil. That is, the subject s pupil will appear as a bright object in the captured image, similar to the red-eye effect in photography. The infrared light also causes a reflection off the cornea. By computing the vector between the corneal reflection and the pupil center, the system can compute the direction of the subject s gaze. In many laboratories, software specific to an operating system is used for presenting stimuli in experiments. For example, Habit software, which runs on Mac OS, is used in several infant laboratories (Cohen, Atkinson, & Chaput, 2004). Experiments run with Habit allow the duration of trials to be set online according to predefined criteria regarding the attention and behavior of the subject. To do this, Habit monitors the operator's key strokes on a computer keyboard, which tell the software where the subject is looking (left, right, or center), and according to these strokes Habit determines the duration of the trial on the screen and moves on to the next trial. The output file of the program contains the direction of looks and cumulative look times in each trial. 354
Problem Statement In the infant laboratory setup described above, the Mac will send both an output video and audio stimulus to a TV screen or a monitor to be viewed by the subject. When an eye tracker is introduced into the system, the video output first has to pass through a scan converter device before continuing on its way to the TV screen. This device splits the video signal into two signals: one which goes into the eye tracker control unit, and a second which passes to the monitor to be displayed (see Figure 1). Mac computer running stimuli presentation software Stimulus video Scan Converter Stimulus video TV screen or PC monitor used for stimuli display Eye tracker camera Eye monitor Scene monitor Scene Video Capture Video Frame Overlay Stimulus video Eye Tracker Control Unit (CU) Serial Data Eye tracking PC Figure 1. System model represents the interaction between a Mac, a PC and an Eye-Tracker system. The Mac sends video and audio output to the TV by pass through a scan converter. The scan converter then splits the video signal into two: one goes to the TV and one goes to the eyetracker control unit. The eye-tracker control unit superimposes crosshairs on the frames indicating the gaze target of the subject and then sends to the scene monitor to be captured by the PC. At the same time the eye-tracker control unit also receives video input from an eye-tracker camera which to be displayed on the eye monitor. The eye tracker obtains the video of the stimulus and then superimposes crosshairs on the frames indicating the gaze target of the subject. In such a setup the eye tracker gets an input of the video, but does not know the start or end frame of a trial within an experiment. Also, the output file consists of rows of eye movement data, each of which corresponds to one frame (or 1/60 th of a second) for the entire experiment. That is, the output file contains no separation between trials within each experiment. The separation between trials is very important since statistical analyses are performed using data from each trial within an experiment. Finally, each video is time stamped; we will take advantage of this timestamp as part of our solution. 355
AFANEH, KITH, AND BERGESON Proposed Solution The solution can be approached in two different ways: hardware or software. Each has its advantages and disadvantages. The advantage of software over hardware is usually cost, and the disadvantage is usually speed. Another important advantage for software-based solutions is portability. To avoid adding new hardware components to the system and reduce cost, we chose the software approach. The video output presented on the TV screen by Habit consists of pretest, habituation, and/or test phases of stimuli, each of which is called a trial. In between the different trials, a short video (the attention getter ) is repeated to draw the attention of the infant. The general idea behind our solution is to make use of the captured attention getter video to detect the start and end frames of the different trials. Then the program can extract this information, detect the rows in the output data file corresponding to these trials, and mark them as separate trials for the purpose of later analyses. Even though eye movement data usually exists for the majority of the trials, the output of the Habit software can still serve as a backup of the gaze direction results in cases where the eye tracker loses data or does not work properly. It also serves to mark the trials in the output file with their specific descriptions. Finally, our hope was to minimize the requirement of user interaction and make the use of the software as easy as possible. We chose to develop a program with a simple Graphical User Interface (GUI) with this in mind. Figure 2 shows the design and appearance of the GUI. Figure 2. The design of GUI is to make the use of the software as easy as possible. It contains step by step instruction for the user to follow to achieve the data analysis. 356
Algorithm Discussion There are two phases in the proposed algorithm. The first phase consists of the image processing and object recognition processing, which help in detection of the different trials in the video. The second phase is the data processing phase which includes reading, preprocessing, and then marking of the rows of the eye movement data file. Phase I Figure 3 shows a flow chart of the basic steps in the first phase of our system solution. To make the process of the algorithm clearer we will describe each step in detail. The first step in this phase is to split the video file into a sequence of jpeg image files with each file name containing the sequence number of that image in the original video (frame0001.jpg, frame0002.jpg, etc.). This video decomposition will allow us to process and analyze each frame in its original sequential order. This step is also beneficial for programs that deal with images, rather than extracting individual frames from the original video file format. The program then loops over all the image files in order, starting with the first frame (frame_number = 1). To be able to tell the software to look for a trial start or end frame in the entire series of frames, we define a variable which is set to 1 when looking for a start frame of a trial (start_flag = 1), and set to zero when looking for a trial end frame (start_flag = 0). If start_flag is set to 1 then the algorithm will first find the correlation of the current frame with a reference image (defined prior to entering the loop). The reference image should be an image which represents the image displayed between the trials (see Figure 4). To find the start and end of trials, we calculate the autocorrelation between the current frame in the video and the reference frame. However, in our case a movie file was presented between trials. This meant that we could not rely on just one reference image in detecting the separation movie. The more practical solution was to choose the correlation coefficient to be 0.8 instead of 1 (chosen by trial and error). This technique required only one reference image which is relatively similar to the frames of the separation movie. The choice of a correlation coefficient of less than 1 also accounted for any artifacts present in the frames. If the correlation is indeed less than 0.8 then we have detected the start frame of the first trial. Next, the algorithm calls another routine which recognizes and extracts the sequence number present in the current frame and stores it in the start frames array. The next step in the algorithm sets the start_flag equal to 0 until we find the end frame of the trial. If start_flag equals 0 then the algorithm will find the correlation between the current frame and the reference frame. If the correlation is larger than or equal to 0.8, then it will decide that the image is indeed the start of the separation movie. In this case, the algorithm calls the digit recognition routine to extract the sequence number present in the previous frame and not the current frame because the previous one was the last frame of the trial and the current frame is not included in the trial. The algorithm then stores the digit recognition result in the end frame array. 357
AFANEH, KITH, AND BERGESON Afterwards, start_flag is set to 1 so that the program searches for the start frame of the next trial and so on until it detects the end frame of the last trial. Finally, the algorithm outputs the two arrays (start frame array and end frame array), each in a separate line to an output file chosen by the user as an argument to the algorithm. Table 1 is an example of the output. The first row represents the start frames and the second represents the end frames of the different trials in order. In this case there were three trials in the experiment. Figure 3. Phase I Flow Chart shows the basic steps in detecting the starting and ending of each trial. First, the algorithm checks if the variable start_flag is equal to 1 or not. If it is, then it searches for a start frame; otherwise it searches for an end frame. After detecting the starting or ending frame, the algorithm recognizes sequence digits and stores those digits in an array. 358
Figure 4. The reference image is an image which represents the image displayed in between the trials. We are able to find the start and end frame for each trial by calculating the correlation between the current frame and this reference image. If the correlation is less then 0.8 then we have detected the start frame. Trial 1 Trial 2 Trial 3 Start Frames 12056 14657 17569 End Frames 13503 15798 20033 Table 1. Output of Start and End Frames for 3 Trials Digit recognition Digit detection and extraction from a particular frame are included in Figure 3. In our system, we had 5 digits superimposed on each video frame, each of which corresponds to a row of data in the eye movement data file. Those frame numbers are superimposed by an external hardware device (called video frame overlay) which takes in two inputs, the video output and the data stream from the eye tracking system; and one output, the video with the superimposed frame numbers. The idea behind using such a device is to synchronize the video output with the eye movement data. By reading the frame number from the video frame one can know the corresponding row of data in the eye movement file. In our solution, we use image processing techniques to detect the different trials in the video, and then go back to the data file and mark the lines corresponding to each trial in order to perform the analysis on each trial separately. 359
AFANEH, KITH, AND BERGESON Figure 5 shows an example of a sub-image taken from a video frame containing the superimposed frame numbers. We chose this frame to illustrate a serious problem: two frame numbers are overlapped on the same frame. This is caused by the down-sampling of the video due to the mismatch between the frame grabber capturing video at 30 Hz and the eye tracker outputting frames at 60 Hz. Because the video is interlaced, the odd rows of the image are captured at the first run of the image capture while the even rows are captured at the second run of the capturing. One digit comes from the odd pass while the other comes from the even pass. We have to note here, however, that this is a worst-case scenario since in most images only the last digit (far right) of the sequence will be an overlap of two consecutive digits. The second to last digit will appear as an overlap every 5 frames and the third to last every 25 and so on. Figure 5. Superimposed frame numbers on a frame are output frame numbers from eye-tracker which are superimposed on a frame. This figure shows the worst-case scenario where two frame numbers are overlapped on the same frame. To solve this issue, for each frame we extract the odd rows and extract the digits from this subimage. There will still be an error detecting the exact row which represents either the start or end frame of that trial, but the maximum error would be 2 frames in each trial, which is negligible, compared to the total number of frames in each trial. Figure 6 shows the even and odd rows extracted from the sub-image in Figure 5. (a) Odd Row (b) Even Row Figure 6. Odd and even row images extracted from the image in Figure 5. (a) A frame number in odd row. (b) A frame number in even row. In order to recognize all digits in a certain frame, we propose an algorithm which uses templatematching theory to recognize one digit at a time and output an array of five elements representing the frame sequence (see Figures 7 and 8). Since the digits in the frames are not located in the exact position in each experiment (possibly because of jpeg artifacts or random errors in the image capture process), the algorithm first searches for the area in which each digits is located. Figure 7. Template images used in the template matching algorithm to recognize each digit of a frame number. 360
Figure 8. The Digit Recognition Algorithm is used to recognize each digit of a frame number. First, this algorithm searches for the area in which each digit is located. Then the variable digit_sequence specifies how many digits need to recognize. Note that for optimization, not all digits will be recognized all the time. 361
AFANEH, KITH, AND BERGESON The method approached was to search for the area that had the highest correlation with each of the digit template images. That is, the correlation between each area and the image templates is calculated and the highest correlation with the digits is stored for each candidate region. Across all the areas, the one with the highest of these correlations is chosen to be the correct digit. This is done for each of the five digits in order to find all the exact regions in each frame. This solution was very accurate in recognizing all five digits. However, the processing time turned out to be much longer than we expected, so to optimize the algorithm we made the program only extract all five digits in cases where it was needed. As we mentioned above, each frame contained an overlap of two sequence numbers. So if a frame was overlapped by the numbers 00001 and 00002, then the following frame will most likely be stamped with an overlap of 00003 and 00004 unless there was loss of one or more frames during the frame capture process. Thus, we could use an approximation method to calculate the number on each frame. In other words, the program could be modified to extract the digits from one frame and then use an offset to determine the frame sequence number on any subsequent frame. However, we found that this approximation method might not always get the exact number on the frame due to frames dropped during capture. For example, sometimes the approximation was 00100 but the exact number was 00114. The difference between these two numbers might vary depending on how many frames dropped during video capture. If 7 frames were dropped, then the difference would be 14.To account for the loss in frames captured and still be able to reduce the running time of the algorithm, we check for the difference between the approximation and the recognition of the last two digits. If the error is less than α, the approximation is added to the recognition of the first frame in that trial. If the error is greater than or equal to α, then we perform recognition for all five digits by using algorithm in Figure 8, where α is defined as the maximum number of frames loss in each trail. This is true assuming that the maximum number of lost frames is 25, thus α = 25 * 2 = 50 (each frame overlap by two numbers). If the difference between the approximation and the recognition of these two digits is less than 50, that leaves no chance for the other digits to be different. An example in which the difference could be greater than 50 is where the approximation is 1196 and the recognition is 04. Thus 96 04 = 92 (compare only the last two digits), so after performing analyses of all five digits the exact digit would be 1204. This modification would mostly use the Digits Recognition Algorithm described in Figure 8 to recognize only two digits instead of five digits all the time. Thus, it would improve the performance speed by up to 60%. Data Alignment and Analysis The next step after detecting the start and end frames of each trial would be to mark the rows in the eye movement data file according to the trial number. After this step, the data are ready for analysis using any eye tracking analysis program which is capable of handling text files as data input. An example of such analysis software is ILAB (Gitelman, 2002). 362
After obtaining the fixation output file, we use it to calculate the fixations on predefined areas of interest (AOI). The idea is simple: detect the coordinates of each fixation, and then determine in which area of interest it exists. This is done per trial because we are interested in analyzing each trial separately. Future Directions In the system solution proposed, we used MATLAB, which is known to be simple and powerful though relatively slow in comparison to other programming languages. Because of this, we believe that migrating from MATLAB to a faster programming language (such as C/C++) would be an important improvement and would save time in executing the steps. Another important future development would be to convert the software to a real-time application so that it detects the trials online. This could also lead to development of software programs that could control the stimulus according to the eye movements of the subject. References Cohen, L.B., Atkinson, D.J., and Chaput, H. H. (2004). Habit X: A new program for obtaining and organizing data in infant perception and cognition studies (Version 1.0). Austin: University of Texas. Gitelman, D. R., (2002). ILAB: A program for post experimental eye movement analysis. Behavior Research Methods, Instruments and Computers, 34(4), 605-612. Gonzalez, R. C., and Woods, R. E. (2000). Digital Image Processing. 2 nd Edition, Prentice Hall. Jacob, R. J. K., (1991). The use of eye movements in human computer interaction techniques: what you look at is what you get. ACM Transactions on Information Systems 9 (3), 152 169. Jacob, R. J. K., (1995). Eye tracking in advanced interface design. In Barøeld, W., & Furness, T. (Eds.), Advanced Interface Design and Virtual Environments, 258-288. Oxford: Oxford University Press. Pelz, B. J., Canosa, L. R., Kucharczyk, D., Babcock, J., Silver, A. and Konno, D., (2000). Portable Eyetracking: A Study of Natural Eye Movements. Proceedings of the SPIE, Human Vision and Electronic Imaging, San Jose, CA: SPIE. Young, L., and Sheena, D., (1975). Survey of eye movement recording methods. Behavior Research Methods and Instrumentation 7, 397 429. 363