6.835 Final Project: A gestural and affective interface for capturing emotion Cassandra Xia May 17, 2013 Introduction Affect induction Affect induction seeks to induce changes in mood or emotion through use of a stimulus. In daily life, it would be useful to have mechanisms to change mood when it falls out of a desired state. It has been shown that environments subtly influence mood. For instance, it is possible to induce positive mood in laboratory settings by giving the subject a bag of candy at the beginning of the research study [Isen, 1994]. Other experiments have suggested subtle priming effects such as holding a warm or cold beverage may influence subsequent ratings of unrelated stimuli such as the warmness of people [Williams, 2008]. Given that mood is subject to arbitrary external forces, is there a way for us to take control back into our own hands? There has been some positive research in this direction. For instance, the facial feedback hypothesis posits that smiling is enough to induce happiness. Research by Strack et al. demonstrated this effect by asking his subjects to hold a pen in their teeth in such a way that forces a smile and found that subjects subsequently rated comics to be funnier [Strack, 1988]. The same study showed that asking the users to hold a pen with their lips that causes a frown induces a negative effect on the funniness ratings of comics. My project draws inspiration from the Strack et al. paper and experiments correlating vision with sensation. For instance, when we see a rubber hand being stroked with a paintbrush at the exact same time that we experience our own hand being stroked with a paintbrush, we begin to associate the rubber hand with our body [Botvinick, 1998]. Another experiment showed that when we fully extend our arm in front of us and tap the air at the same rate as someone else taps our nose, many people begin to feel that their nose is several feet long [Ramachandran, 1999]. Both experiments demonstrate how quickly the body is willing to correlate and attribute the body's internal state to external stimuli. Proposal Inspired by the experiments described, my term project tested the idea of capturing emotion in the sensations induced by a haptic device. When a user experiences the emotion she would like to capture, she uses the device to initiate a haptic sensation. After the device is used in this way multiple times, the user's body may begin to correlate that haptic sensation with the emotion. If the emotion has been successfully captured, it should then be possible to induce the emotion by simply providing the body with the haptic sensation. 1
In order to test this hypothesis, I made four pairs of vibrating gloves. The gloves were made out of conductive fabric, conductive thread, vibration motors, and batteries. Touching the gloves together complete the circuit and results in vibrations. Figure 1: Homebrewed vibration gloves Initial User Study Experimental Method I ran an initial user study consisting of 20 subjects (11 control and 9 experimental). Subjects were not compensated for their participation. Subjects under the experimental condition watched the 90-minute film Moonrise Kingdom and were instructed to clap their hands when they found themselves smiling or laughing during the movie. Four of the subjects were given vibrating gloves which would vibrate when they clap their hands together. Five of the subjects were wearing normal white gloves. Subjects were not told the purpose of the experiment. Instead they were told that the experiment measures the effect of social dynamics on what movie content is perceived as funny. In actuality, the movie served as a training period in which subjects would begin to associate the gesture of clapping with the feeling of happiness. After watching the movie, subjects placed their fingertips on the electrodes of an Affectiva Q Sensor [Picard, 2011] to obtain a baseline electrodermal activity (EDA) reading. Subsequently, the user was asked to clap his or her hands for five seconds. Then a second EDA reading was obtained. EDA measures the arousal level of the subject (i.e. how excited the subject is). Subjects also filled out a post-survey questionnaire that asked approximately how many times they clapped during the movie among questions meant to distract from the real purpose of the experiment. All questionnaires used for studies in this project are given in the zip submission. Figure 2: Affectiva Q Sensor for EDA readings 2
Subjects in the control group did not watch the movie (i.e. did not have a training period). Control subjects placed their fingertips on the Affectiva Q Sensor to obtain a baseline EDA. They subsequently clapped their hands for five seconds. A second EDA reading was then obtained. Results Figure 3 and Figure 4 present EDA sensor data from the experimental and control subjects respectively. Every two-pair grouping of EDA peaks belongs to one subject. The first peak in the pair is the baseline reading of the person. The second peak is the reading obtained after the person clapped his or her hands. The data should not be read on an absolute scale because different people have different EDA baselines first peaks can only be compared to second peaks for the same person. Some of the sensor data for the control subjects were corrupt (were missing sections) and those readings were discarded in calculations for the Table 1. Figure 3: EDA readings for experimental subjects (training time + vibration gloves or normal gloves) Figure 4: EDA readings for control subjects (no training time + no gloves) % of subjects with higher second EDA reading Number of subjects in group Movie + Vibration gloves 75% 4 Movie + Non-vibrating gloves 60% 5 Control (no movie + no gloves) 45% 11 Table 1: Summarized EDA readings for Study 1 3
Although the size of this study is not large enough to make definitive statements, the results seem to suggest that subjects who had been trained to clap at happy points during the movie had a larger emotional response when subsequently asked to clap. However, there were a number of problems with this study: - EDA measures arousal not valence. - EDA readings are sensitive to positioning and activity. - Subjects need to watch movies alone to avoid group influences - The groups used for the experiment were imperfect. The idea control would have watched the movie but not clapped. Post-study feedback collected from subjects revealed that: - Some people found the vibration of the gloves to be negative feedback! Indeed, the people who were given vibrating gloves according to their own estimates clapped significantly less than those who had been given non-vibrating gloves. - Some people felt self-conscious about clapping in a group. Second User Study In order to address the problems from the first user study, I made several changes before running a second user study: - Training content was changed from a 90-minute movie to shorter animated clips, since it was difficult to find subjects willing to watch an entire movie. - Subjects were paid $10 in Amazon credit and recruited via mailing lists instead of word-of-mouth through friends. - EDA readings were no longer used because they measure arousal (excitement level) and I am actually interested in valence (good and bad). - Instead of EDA readings, I recorded facial reaction clips and planned to use UCSD s Computer Expression Recognition Toolkit (CERT) [Littlewort, 2011] for quantifying facial expression in terms of Ekman s facial action units (FACS) [Ekman, 1997] and perceived emotion. - In addition to measuring facial emotion, I asked users to rate the funniness of comics as another proxy for affect. Affect was correlated with how funny users found certain comics. This was the same method Strack et al. used for testing the facial feedback hypothesis. Following the methodology of Strack et al., I selected several Far Side comics that were moderately funny. Far Side comics have the advantage of being single-pane comics that are quick to read. The funniness of Far Side humor seems universal, unlike comics like xkcd or P.h.D. which may be funnier for people with more education. 4
Figure 5: Four of the eight Far Side comics used Experimental Method Subjects were invited to come to the lab alone and participate in the user study. Subjects were told an improved cover story. Subjects were told that they would be participating in an experiment to help code the emotional content of videos. Instructions were given to this corroborate this story: keep your head faced frontally towards the webcam, make sure the hair is out of your face, etc. Subjects were told that in order to assist with emotion coding, they should signal the emotional high points during the videos with body gestures. Some subjects were asked to (1) for hug themselves when they felt <HAPPY> and (2) clasp their hands together when they felt <SAD>. Other subjects were asked to do the opposite: (1) hug themselves when they felt <SAD> and (2) clasp their hands when they felt <HAPPY>. Subjects were seated alone to watch two animated short videos while being recorded by a webcam: Sintel (2010) and Paperman (2012). Sintel is a dark movie in which a girl ends up killing her pet dragon. Paperman is love story about how magical paper airplanes end up reuniting a man and woman who met on a train platform. Subjects watched the movie in the order of Sintel (12 min) first and then Paperman (6 minutes) second. Figure 6: Screenshots from Sintel and Paperman After watching the clips, subjects were asked to repeat the first of the two gestures in front of the webcam and to hold the gesture for several seconds. The subject was given the first set of four comics to rate on funniness. This process was repeated for the second gesture. The study was randomized so that the each set of comics were seen after performing the happy gesture and the sad gesture. This was meant to cancel out any intrinsic variation in the funniness of the sets of comics. 5
The hypotheses were that: - Subjects trained to gesture for happy events would display more happy emotion when repeating the gesture after the movie - Subjects trained to gesture for negative events would display more negative emotion when repeating the gesture after the movie - When taken with respect to the inherent funniness of the set of comics, subjects would rate comics as funnier after performing the happy gesture than after performing the sad gesture. Results Based on how subjects rated the funniness of comics, I tried to predict whether their rating was given after performing the happy gesture or the sad gesture. To do this, I calculated the difference between the subjectassigned funniness of each of the eight comics and the average funniness of the eight comics. I then calculated whether the sum of the user rating differences was greater for the first set of comics or for the second. Under the initial hypothesis, I assumed that the more positively rated comics were associated with the happy gesture. The accuracy that this classification achieved was not significantly different from random (accuracy = 64%; n = 14, p=0.2120). The Excel calculations are in this zipped submission in data.xlsx. Although I had planned to analyze subjects facial reactions using CERT, which provides frame-by-frame analysis of FACS and facial emotion, CERT had a number of problems. For one, CERT was not able to find trackers on dark-skinned individuals. The problem was partially solved by illuminating the face with strong light head-on, although solution would be uncomfortable to subjects. After trying this on myself for a couple minutes, I had black patches in my field of vision due to overfired retina cells. Even when CERT was able to track individuals, from observation, sometimes it would find features in incorrect regions of the face or inanimate objects in the background, as shown in Figure 7a and 7b. Additionally, CERT was very rarely able to find all the features it wanted to track, so features would appear and disappear off the screen. Figure 7a: CERT incorrectly found one of the lip corner trackers on the side of the subject s face. 6
Figure 7b: CERT incorrectly found trackers on the subject s chest and in the background environment. Instead of using CERT, I used a human coder (my housemate) who did not know the intent of the experiment. The human coder was just asked to judge which of the two videos taken of each subject exhibited more positive emotion. Ideally, the experiment would utilize multiple human coders and measure a Cohen s kappa for agreement between them, but I did not have time. Based on the experimental hypothesis, I assumed that the video that was rated more positively would be the video of the subject performing the happy gesture. However, the classification I was able to achieve approached significance was not found to be significantly different from random (accuracy = 69%; n = 13, p=0.1334). Although some subjects spontaneously smiled when performing their happy gesture, one subject smiled for her negative gesture. When questioned, she said that she felt awkward performing the gesture and thus she smiled. Most subjects, however, maintained very neutral expressions while they performed both happy and sad gestures. No visible difference in emotion would be perceivable from most videos of the same subject. Videos of most subjects performing the gestures after training can be viewed at this link (2 GB): http://bit.ly/11kwlp7 Videos of subjects performing gestures during training was also captured, and can be provided if needed (please email xiac@mit.edu to coordinate a transfer. Net size for the full file dump is > 30GB since the video needed to be captured in high resolution.avi format for CERT processing). The ratings assigned by the human coder are included in this zipped submission in data.xlsx. There were a number of problems that plagued this second study: - Subjects were self-conscious when they saw themselves on the webcam. After I detected this problem, I addressed it by hiding the webcam out of view when filming them. - Subjects thought that the words <POSITIVE> and <NEGATIVE> serve better than <HAPPY> and <SAD> when asking subjects to describe their emotion. Multiple subjects commented on the fact that they experienced other emotions during the films beyond happy or sad such as anxious, worried, and scared. - While the comics didn t have any strong correlation with the valence of the gestures, it was unclear whether it was because (a) the subject did not hold the gesture while rating the comics and the effect of gesture was not strong enough to linger or (b) if there was actually no effect in emotion from 7
performing the gesture. Upon further reflection, comics are NOT a good method for measuring affect. I believe the reason Strack et al. chose comics for their experiment is because there are not many other tests they give their subjects when they have a pen in their mouth! Since my subjects are not so restricted, there are likely better tests for valence and I explore this in my redesigned third user study. Third User Study The third user study attempted to remedy some of the problems found in the second study. Instead of using comics to detect affect, I used a word association game. The word association game asks users to generate words as quickly as possible, each based on the immediately preceding word. The first word in the word association game is given to the subject and was either smell or touch. The first words were chosen to be neutral words that could be interpreted either positively or negatively. Instead of using short clips to induce affect, I used a 247 image slideshow in which each image was presented for 5 seconds. The images were all taken from the Geneva Affective Picture Database (GAPED) [Dan-Glauser, 2011]. All images from the sections P (positive) and A (animal mistreatment) were selected and presented in randomized order. Sections P and A were chosen because they were the two sections of GAPED with the greatest and lowest valence, respectively. The full slideshow is available at: https://dl.dropboxusercontent.com/u/22304894/slideshow.7z Figure 8: Example GAPED images from the slideshow used to induce emotion Method Subjects were invited to the lab individually to participate in a user study. Subjects were told a cover story of needing to train an emotion classifier. Subjects were told that they would be looking at a slideshow of images designed to induce a wide variety of emotions from very positive to very negative. Subjects were told that we would be filming their face while they watched they slideshow and that they should gesture when they felt emotion during the slideshow. Some subjects were asked to hug themselves when they felt <POSITIVE> emotion and clasp their hands when they felt <NEGATIVE> emotion. Other subjects were told to perform the opposite gesture. After watching the slideshow, subjects were told that we needed to take calibration clips. This was the cover story used to get post-training video clips from subjects. For the first type of calibration clip, the subject was 8
asked to face the camera and perform each of the two gestures for 7 seconds. For the next type of calibration clip, the subject was told that we wanted to view them performing the gesture under cognitive load and that cognitive load would be induced via the word association game as previously described. The subject was then asked to hold each of the two gestures as they played the word association game for 45 seconds. The initial words for the word association game were randomly distributed to the subject as they held positive and negative gestures. Half-way through the third user study, at the suggestion of my officemate, instead of giving instructions to the subjects in person, I made a video of myself giving the instructions and simply instructed them to watch it. This guaranteed that the delivery of the instructions was uniform across users. It also allowed users more privacy as I did not have to be standing there as they did their word association. Given that I am working in the domain of affect (and especially displays of affect), this consideration was important, since displays of affects are heavily influenced by social behavior. Results The size of this study was smaller than I anticipated due to the number of no-show subjects. I summarized this effect as the Decay Law for Experiment Subjects. If you schedule people for the day of, everyone shows up. If you schedule people for the day after, 25% no-show and will send you an email about it. If you schedule people two days ahead, 50% no-show and you will never hear from them again. I had the same human coder view four video clips belonging to each of the five subjects: two 7-second clips of the subject performing each of the gestures after the slideshow training session and two 45-second clips of the subject playing the word association game while holding each of the gestures. For the two 7-second clips, the human coder was instructed to pick the video clip within that displayed more positive affect in the face. That clip was assumed to be the positive gesture. For the two 45-second clips of the subject, the human coder was instructed to pick the word association that used more positive words. That clip was assumed to be the positive gesture. Although the human coder noticed that the subjects were performing gestures, the human coder remained unaware of the intent of the experiment and did not know the 7-second and 45-second clips might be correlated. He remained unaware of the correlation during questioning. Using the human coder s ratings, the classifier was not able to achieve a significantly different result from random via the 45-second word association video clips (accuracy = 60%; n=5, p=0.5). The classifier via the 7- second word association clips was worse than random (accuracy = 40%; n=5, p=0.81). Experience For future user studies, I would like to return to the idea of capturing emotion in objects that give additional sensation rather than gestures alone. However, after spending $250+ and 40+ hours running user studies without obtaining statistically significant results, I learned something important about user studies that I would need to work through before planning a follow-up. 9
Before starting my 6.835 final project I had believed that user studies were science, in the sense that I would do the study and see if the result is there or not. But I now that user studies involving humans have far too many variables to take this approach. Even when running a simple experiment, small choices in wording, media selection, and measurement selection make huge differences. Additionally, it is important to control for possible ordering effects and other presentation factors. In order to control for these effects, the experiment must be run over large numbers of subjects. However, this is rarely possible since human subjects are both expensive and tedious! There are simply too many variables to control for to get enough subjects to treat human user studies as science. I have come to the opinion that user studies should be treated like math, in the sense that as the experimenter, you should design the experiment to prove an effect. This is likely a controversial view point, but I think I would have to be personally convinced that an effect exists before embarking on another fullfledged user study. Acknowledgements I would like to thank Brandon Nardone for his invaluable help in recruiting study subjects, Javier Hernandez, for use of the Q sensors and helpful methodological feedback, Dan McDuff for his affective computing guest lecture, Professor Pattie Maes for her idea of training users over a long training session like a movie, and Professor Randy Davis for patching up logical holes in my experimental groups. Any errors that remain are my own. References 1. Botvinick, M., & Cohen, J. (1998). Rubber hands 'feel' touch that eyes see. Nature, 391(6669), 756-756. 2. Dan-Glauser, E. S., & Scherer, K. R. (2011). The Geneva affective picture database (GAPED): a new 730-picture database focusing on valence and normative significance. Behavior research methods, 43(2), 468-477. 3. Ekman, P., & Rosenberg, E. L. (1997). What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA. 4. Estrada, C. A., Isen, A. M., & Young, M. J. (1994). Positive affect improves creative problem solving and influences reported source of practice satisfaction in physicians. Motivation and Emotion, 18(4), 285-299. 5. Littlewort, G., Whitehill, J., Wu, T., Fasel, I., Frank, M., Movellan, J., & Bartlett, M. (2011, March). The computer expression recognition toolbox (CERT). In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on (pp. 298-305). IEEE. 6. Picard, R. W. (2011). Measuring affect in the wild. In Affective Computing and Intelligent Interaction (pp. 3-3). Springer Berlin Heidelberg. 7. Ramachandran, V. S. (1999). Phantoms in the brain: Probing the mysteries of the human mind. Harper Perennial. 8. Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: a nonobtrusive test of the facial feedback hypothesis. Journal of personality and social psychology, 54(5), 768. 9. Williams, L. E., & Bargh, J. A. (2008). Experiencing physical warmth promotes interpersonal warmth. Science, 322(5901), 606-607. 10