Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log Instructor: Eakta Jain CIS 6930, Research Methods for Human-centered Computing Scribe: Chris(Yunhao) Wan, UFID: 1677-3116 University of Florida, Spring 2015 After we designed our experiment and collected the result data we want, the next step is to analyze the data we have got. In Human Centered Computing, we use statistical procedures for hypothesis testing. In this lecture, we will talk about descriptive statistics. 1 Descriptive Statistics descriptive statistic is the analysis of data which helps describe data in a meaningful way.however, descriptive statistics do not allow us to make conclusions beyond the data we have analyzed or reach conclusions regarding any hypotheses we might have made. They are just a way to describe the data. In our lecture, we will use R to demonstrate several terms in descriptive statistics, which includes mean, standard deviation, minimum, maximum and box plot. For example, we have the data collected like the following table. Before we go further into the details, we can use two simple R command names(data) to see the detail dimensions and dim(data) to check the record and dimension numbers. These two commands can help us check whether the data has been loaded correctly. Participant Test Condition Result A B C 1 Yes 16.9 5 123 2 No 11.7 1 456 3 No 23.9 23 789 4 Yes 16 4 102............... Table 1: Sample Experiment Data 1.1 Mean The mean is the arithmetic average of all values. It is computed by dividing the sum of the values of the number of values. In R, we can use command mean(data) to get the arithmetic mean of a set of data. e.g. for the data in the previous table, we can type mean(data[,5]) to get the mean of the Result, which is the dependent variable in that experiment. 1
1.2 Standard Deviation To understand what is standard deviation, first we need to understand the definition of the variance, which is a measure of variation from the mean of the squared deviation values about the mean. Standard deviation is the square root of the variance. To calculate the standard deviation, it follows these steps (we will still use the previous table as our example): 1. determine the mean of the Result 2. subtract the mean from each Result to determine the deviation value for that item 3. squaring each deviation value and multiplying it by the frequency of that Result to account for the total number of results 4. summing the results of previous multiplication step to arrive at the total of all squared deviation values and dividing by N 1, N denotes to number of results 5. calculate the square root of the variance in previous step In R, there is a command sd(data) can help us to calculate the standard deviation. In our example, the command will look like sd(data[,5]). 1.3 Minimum and Maximum Minimum and Maximum are the smallest and the largest value in a set of data. In R, we also have the command min(data) and max(data) to get these two values. 1.4 Box Plot Box plot, also known as box and whisker plot, is a way to prepare quantitative data visually. Figure 4 is an example of the box plot. Figure 1: Box Plot example The box plot splits the data set into quartiles. The box goes from the first quartile(q1) to the third quartile(q3), which is the body of the box plot. the line in the box denotes Q2, which is the 2
median of the data set. Two horizontal lines extend from the begin and end of the box is called whiskers. The front whisker starts from the smallest non-outlier in the data set and ends at Q1. The back whisker goes from Q3 to the maximum non-outliers. If the data set includes outliers, they will be plotted as points outside of the box, and they are easily to be seen through box plot. In our example, there are three outliers. In R, we can use command boxplot(data) to plot the box plot. These statistics can help us to do the data sanity check. 2 Normal Distribution and Skewness The normal distribution, also known as Bell Curve, is an important distribution in statistics. A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. Literally, the data set which is not normal distributed is skewed. which means there are more data distributed either on the left or right of the mean. In R, we can use the command hist(data) to plot a set of data and look the distribution of the data. Comparing the histogram with a normal distribution to check whether the data is skewed. Figure 2: An example of Normal Distribution [bc] 3 Project Updates and Miscellaneous Before we step further on the analysis of our experiment results, we should cleanse and purify our data first. Considering the way to cleansing the data, it depends on multiple factors of the experiment. Some will cleanse the data manually, some will do it by writing a script. 3.1 More Examples 3.1.1 Example in Class Here we will have one example to go over all the concepts we just discussed. Example: 10 students who participated in an experiment which required them to use gestures to finish a series of interface control operation. The purpose for this experiment is to find out which control device, Leapmotion or Kinect, is more appropriate for the designed 3
user interface. The dependent variable is the time they took to finish all tasks. The final collected experiment data is shown in the following table 2: Participant Device Time 1 Leapmotion 16.9 2 Leapmotion 58.1 3 Leapmotion 43.2 4 Leapmotion 38 5 Leapmotion 42 6 Kinect 51 7 Kinect 63.5 8 Kinect 77 9 Kinect 35 10 Kinect 68 Table 2: Sample Experiment Data Through the descriptive statistics method we discussed and using R, we get the following statistics: Condition Mean Standard Deviation Maximum Minimum Kinect 58.9 16.3187 77 35 Leapmotion 39.64 14.82238 58.1 16.9 All 49.27 17.86176 77 16.9 Table 3: Descriptive Statistics from Sample Data The following figure is the box plot for the collected data: Figure 3: Box Plot for Sample Data 20 30 40 50 60 70 Kinect Leapmotion 4
3.1.2 Additional Example In the auditory display experiment: Sound Sample Detection and Numerosity Estimation Using Auditory Display, one of the task is let the subject hear 80 rounds of sound clips. Each sound clip composed with 6 different samples, which can be either speech word or musical instrument. In this experiment, the independent variable is SOA, Stimulus Onset Asynchony, which means the separation (in time) between the start of the playback of different auditory stimuli. Each round, the SOA value will be chosen from 50ms,100ms,200ms, and 400ms. Before the experiment starts, the participant will be given one speech key sample and one instrument key sample through all the experiment. And After each round, they will be asked to count the number of the key sample appearance. In the sound clips, the key sample will appear from 1 to 7 times. And we will record the number which the user counted and get the distance between how far they are from the accurate number following the formula: #count #appeared. Following table shown part of the final results: SOA(ms) Round ID key NumKeyAppear Answer Distance..................... 100 22 3 dog 5 2 3 50 70 3 dog 4 3 1 200 5 5 drum 3 2 1 200 21 5 drum 3 3 0 100 22 6 mad 2 1 1 50 70 6 mad 2 1 1 400 5 8 horn 3 2 1 50 21 8 horn 2 2 0 200 22 13 dog 5 1 4 400 70 13 dog 4 3 1 100 4 14 horn 3 2 1 200 20 14 horn 2 2 0 50 36 14 horn 4 3 1..................... Table 4: Sample Experiment Data Through the descriptive statistics method we discussed and using R, we get the following statistics based on the users performance, in here is the distance: SOA mean standard deviation max min 50ms 0.16935960 0.12970330 0.42105263 0.00000000 100ms 0.08986529 0.13979386 0.50000000 0.00000000 200ms 0.03950224 0.06455598 0.18421053 0.00000000 400ms 0.03208665 0.03987777 0.0882352 0.00000000 Table 5: Descriptive Statistics from Numerosity Task The following figure is the box plot for the stats data: 5
Figure 4: Box Plot for Numerosity Task 3.2 Requirements for Project Updates Plot the data for experiment results, comparing it to normal distribution to check whether they are skewed. Plot data for each test condition and check whether it is normal distribution box plot for each test condition In the project update report, we should have a table including mean, standard deviation, maximum, minimum. 4 Summary Descriptive statistics are very important since if we have a lot of data in our experiment, it is hard to understand and visualize what the data was showing. But descriptive statistics enables us to present data in a more meaningful way, it interprets data in a simpler way. It also allows us to compare with the normal distribution of the data to check whether the data is skewed. References [bc] bell curve. http://www.mathsisfun.com/definitions/normal-distribution.html. Math Is Fun Website. 6