Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log



Similar documents
Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Means, standard deviations and. and standard errors

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

MEASURES OF VARIATION

Exercise 1.12 (Pg )

Descriptive Statistics

Exploratory data analysis (Chapter 2) Fall 2011

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Module 4: Data Exploration

Shape of Data Distributions

Lecture 1: Review and Exploratory Data Analysis (EDA)

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

AP STATISTICS REVIEW (YMS Chapters 1-8)

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Measures of Central Tendency and Variability: Summarizing your Data for Others

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Geostatistics Exploratory Analysis

First Midterm Exam (MATH1070 Spring 2012)

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Exploratory Data Analysis. Psychology 3256

2. Filling Data Gaps, Data validation & Descriptive Statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data

Introduction; Descriptive & Univariate Statistics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

How Does My TI-84 Do That

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

3: Summary Statistics

Introduction to Quantitative Methods

AP * Statistics Review. Descriptive Statistics

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Topic 9 ~ Measures of Spread

Exploratory Data Analysis

Data Exploration Data Visualization

Bar Graphs and Dot Plots

Interpreting Data in Normal Distributions

Calculation example mean, median, midrange, mode, variance, and standard deviation for raw and grouped data

Data Analysis Tools. Tools for Summarizing Data

How To Write A Data Analysis

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

1.5 Oneway Analysis of Variance

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Descriptive Statistics

Probability Distributions

Descriptive Statistics

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Formulas & Functions in Microsoft Excel

Mathematical goals. Starting points. Materials required. Time needed

Mean = (sum of the values / the number of the value) if probabilities are equal

Using SPSS, Chapter 2: Descriptive Statistics

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

CALCULATIONS & STATISTICS

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Standard Deviation Estimator

STAT 360 Probability and Statistics. Fall 2012

Northumberland Knowledge

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

What is a Box and Whisker Plot?

STAT355 - Probability & Statistics

DESCRIPTIVE STATISTICS & DATA PRESENTATION*


The Dummy s Guide to Data Analysis Using SPSS

6.4 Normal Distribution

Variables. Exploratory Data Analysis

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

RECOMMENDED COURSE(S): Algebra I or II, Integrated Math I, II, or III, Statistics/Probability; Introduction to Health Science

2 Describing, Exploring, and

Exploratory Data Analysis

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

THE BINOMIAL DISTRIBUTION & PROBABILITY

Name: Date: Use the following to answer questions 2-3:

Diagrams and Graphs of Statistical Data

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

Simple Regression Theory II 2010 Samuel L. Baker

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Mean, Median, and Mode

4 Other useful features on the course web page. 5 Accessing SAS

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Describing, Exploring, and Comparing Data

2013 MBA Jump Start Program. Statistics Module Part 3

How To Test For Significance On A Data Set

Lesson 4 Measures of Central Tendency

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Lecture Notes Module 1

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Algebra I Vocabulary Cards

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

How to Verify Performance Specifications

3.2 Measures of Spread

Descriptive statistics parameters: Measures of centrality

Transcription:

Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log Instructor: Eakta Jain CIS 6930, Research Methods for Human-centered Computing Scribe: Chris(Yunhao) Wan, UFID: 1677-3116 University of Florida, Spring 2015 After we designed our experiment and collected the result data we want, the next step is to analyze the data we have got. In Human Centered Computing, we use statistical procedures for hypothesis testing. In this lecture, we will talk about descriptive statistics. 1 Descriptive Statistics descriptive statistic is the analysis of data which helps describe data in a meaningful way.however, descriptive statistics do not allow us to make conclusions beyond the data we have analyzed or reach conclusions regarding any hypotheses we might have made. They are just a way to describe the data. In our lecture, we will use R to demonstrate several terms in descriptive statistics, which includes mean, standard deviation, minimum, maximum and box plot. For example, we have the data collected like the following table. Before we go further into the details, we can use two simple R command names(data) to see the detail dimensions and dim(data) to check the record and dimension numbers. These two commands can help us check whether the data has been loaded correctly. Participant Test Condition Result A B C 1 Yes 16.9 5 123 2 No 11.7 1 456 3 No 23.9 23 789 4 Yes 16 4 102............... Table 1: Sample Experiment Data 1.1 Mean The mean is the arithmetic average of all values. It is computed by dividing the sum of the values of the number of values. In R, we can use command mean(data) to get the arithmetic mean of a set of data. e.g. for the data in the previous table, we can type mean(data[,5]) to get the mean of the Result, which is the dependent variable in that experiment. 1

1.2 Standard Deviation To understand what is standard deviation, first we need to understand the definition of the variance, which is a measure of variation from the mean of the squared deviation values about the mean. Standard deviation is the square root of the variance. To calculate the standard deviation, it follows these steps (we will still use the previous table as our example): 1. determine the mean of the Result 2. subtract the mean from each Result to determine the deviation value for that item 3. squaring each deviation value and multiplying it by the frequency of that Result to account for the total number of results 4. summing the results of previous multiplication step to arrive at the total of all squared deviation values and dividing by N 1, N denotes to number of results 5. calculate the square root of the variance in previous step In R, there is a command sd(data) can help us to calculate the standard deviation. In our example, the command will look like sd(data[,5]). 1.3 Minimum and Maximum Minimum and Maximum are the smallest and the largest value in a set of data. In R, we also have the command min(data) and max(data) to get these two values. 1.4 Box Plot Box plot, also known as box and whisker plot, is a way to prepare quantitative data visually. Figure 4 is an example of the box plot. Figure 1: Box Plot example The box plot splits the data set into quartiles. The box goes from the first quartile(q1) to the third quartile(q3), which is the body of the box plot. the line in the box denotes Q2, which is the 2

median of the data set. Two horizontal lines extend from the begin and end of the box is called whiskers. The front whisker starts from the smallest non-outlier in the data set and ends at Q1. The back whisker goes from Q3 to the maximum non-outliers. If the data set includes outliers, they will be plotted as points outside of the box, and they are easily to be seen through box plot. In our example, there are three outliers. In R, we can use command boxplot(data) to plot the box plot. These statistics can help us to do the data sanity check. 2 Normal Distribution and Skewness The normal distribution, also known as Bell Curve, is an important distribution in statistics. A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. Literally, the data set which is not normal distributed is skewed. which means there are more data distributed either on the left or right of the mean. In R, we can use the command hist(data) to plot a set of data and look the distribution of the data. Comparing the histogram with a normal distribution to check whether the data is skewed. Figure 2: An example of Normal Distribution [bc] 3 Project Updates and Miscellaneous Before we step further on the analysis of our experiment results, we should cleanse and purify our data first. Considering the way to cleansing the data, it depends on multiple factors of the experiment. Some will cleanse the data manually, some will do it by writing a script. 3.1 More Examples 3.1.1 Example in Class Here we will have one example to go over all the concepts we just discussed. Example: 10 students who participated in an experiment which required them to use gestures to finish a series of interface control operation. The purpose for this experiment is to find out which control device, Leapmotion or Kinect, is more appropriate for the designed 3

user interface. The dependent variable is the time they took to finish all tasks. The final collected experiment data is shown in the following table 2: Participant Device Time 1 Leapmotion 16.9 2 Leapmotion 58.1 3 Leapmotion 43.2 4 Leapmotion 38 5 Leapmotion 42 6 Kinect 51 7 Kinect 63.5 8 Kinect 77 9 Kinect 35 10 Kinect 68 Table 2: Sample Experiment Data Through the descriptive statistics method we discussed and using R, we get the following statistics: Condition Mean Standard Deviation Maximum Minimum Kinect 58.9 16.3187 77 35 Leapmotion 39.64 14.82238 58.1 16.9 All 49.27 17.86176 77 16.9 Table 3: Descriptive Statistics from Sample Data The following figure is the box plot for the collected data: Figure 3: Box Plot for Sample Data 20 30 40 50 60 70 Kinect Leapmotion 4

3.1.2 Additional Example In the auditory display experiment: Sound Sample Detection and Numerosity Estimation Using Auditory Display, one of the task is let the subject hear 80 rounds of sound clips. Each sound clip composed with 6 different samples, which can be either speech word or musical instrument. In this experiment, the independent variable is SOA, Stimulus Onset Asynchony, which means the separation (in time) between the start of the playback of different auditory stimuli. Each round, the SOA value will be chosen from 50ms,100ms,200ms, and 400ms. Before the experiment starts, the participant will be given one speech key sample and one instrument key sample through all the experiment. And After each round, they will be asked to count the number of the key sample appearance. In the sound clips, the key sample will appear from 1 to 7 times. And we will record the number which the user counted and get the distance between how far they are from the accurate number following the formula: #count #appeared. Following table shown part of the final results: SOA(ms) Round ID key NumKeyAppear Answer Distance..................... 100 22 3 dog 5 2 3 50 70 3 dog 4 3 1 200 5 5 drum 3 2 1 200 21 5 drum 3 3 0 100 22 6 mad 2 1 1 50 70 6 mad 2 1 1 400 5 8 horn 3 2 1 50 21 8 horn 2 2 0 200 22 13 dog 5 1 4 400 70 13 dog 4 3 1 100 4 14 horn 3 2 1 200 20 14 horn 2 2 0 50 36 14 horn 4 3 1..................... Table 4: Sample Experiment Data Through the descriptive statistics method we discussed and using R, we get the following statistics based on the users performance, in here is the distance: SOA mean standard deviation max min 50ms 0.16935960 0.12970330 0.42105263 0.00000000 100ms 0.08986529 0.13979386 0.50000000 0.00000000 200ms 0.03950224 0.06455598 0.18421053 0.00000000 400ms 0.03208665 0.03987777 0.0882352 0.00000000 Table 5: Descriptive Statistics from Numerosity Task The following figure is the box plot for the stats data: 5

Figure 4: Box Plot for Numerosity Task 3.2 Requirements for Project Updates Plot the data for experiment results, comparing it to normal distribution to check whether they are skewed. Plot data for each test condition and check whether it is normal distribution box plot for each test condition In the project update report, we should have a table including mean, standard deviation, maximum, minimum. 4 Summary Descriptive statistics are very important since if we have a lot of data in our experiment, it is hard to understand and visualize what the data was showing. But descriptive statistics enables us to present data in a more meaningful way, it interprets data in a simpler way. It also allows us to compare with the normal distribution of the data to check whether the data is skewed. References [bc] bell curve. http://www.mathsisfun.com/definitions/normal-distribution.html. Math Is Fun Website. 6