Exploratory Data Analysis

Similar documents
Data Exploration Data Visualization

Exploratory data analysis (Chapter 2) Fall 2011

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 1: Review and Exploratory Data Analysis (EDA)

Exploratory Data Analysis

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Descriptive Statistics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Chapter 7 Section 7.1: Inference for the Mean of a Population

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Variables. Exploratory Data Analysis

Name: Date: Use the following to answer questions 2-3:

AP * Statistics Review. Descriptive Statistics

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Exercise 1.12 (Pg )

Geostatistics Exploratory Analysis

Week 1. Exploratory Data Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Diagrams and Graphs of Statistical Data

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Interpreting Data in Normal Distributions

Exploratory Data Analysis. Psychology 3256

Descriptive Statistics

First Midterm Exam (MATH1070 Spring 2012)

Chapter 1: Exploring Data

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Simple Regression Theory II 2010 Samuel L. Baker


Common Tools for Displaying and Communicating Data for Process Improvement

Shape of Data Distributions

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Northumberland Knowledge

3: Summary Statistics

Chapter 23 Inferences About Means

Exploratory Data Analysis

Unit 27: Comparing Two Means

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II

IN THE HANDS OF TIME

Lecture 2. Summarizing the Sample

Chapter 2 Data Exploration

MBA 611 STATISTICS AND QUANTITATIVE METHODS

2. Filling Data Gaps, Data validation & Descriptive Statistics

Statistics 151 Practice Midterm 1 Mike Kowalski

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Means, standard deviations and. and standard errors

Mean = (sum of the values / the number of the value) if probabilities are equal

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

Statistics Review PSY379

Statistics. Measurement. Scales of Measurement 7/18/2012

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Chapter 7 Section 1 Homework Set A

Introduction to Quantitative Methods

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

How To: Analyse & Present Data

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

II. DISTRIBUTIONS distribution normal distribution. standard scores

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Descriptive Statistics and Measurement Scales

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

AP Statistics Solutions to Packet 2

Demographics of Atlanta, Georgia:

Data Analysis Tools. Tools for Summarizing Data

EXPLORING SPATIAL PATTERNS IN YOUR DATA

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Module 2: Introduction to Quantitative Data Analysis

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

determining relationships among the explanatory variables, and

Descriptive Statistics

How Does My TI-84 Do That

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Analyzing and interpreting data Evaluation resources from Wilder Research

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

+ Chapter 1 Exploring Data

List of Examples. Examples 319

Geography 4203 / GIS Modeling. Class (Block) 9: Variogram & Kriging

1. How different is the t distribution from the normal?

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Data Mining and Visualization

Graphical Representation of Multivariate Data

Introduction to Exploratory Data Analysis

Reducing the Costs of Employee Churn with Predictive Analytics

9. Sampling Distributions

Transcription:

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46

Outline Data, revisited The purpose of exploratory data analysis Learning to see Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 2 / 46

Data: A Review Things and Data Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 3 / 46

Data: A Review Things and Data Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 4 / 46

Data: A Review Where Data Come From Data are measurements of individuals (people, trees, countries, ecosystems...). An Individual Data 56 years old 70" tall 180 lbs Brown eyes Moderately presbyo8c Good health Married One child... A Data Table Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 5 / 46

Exploratory Data Analysis What is Exploratory Data Analysis (EDA)? In terms of the Fundamental Model of Data, y = f (x, ɛ): EDA infers which factors strongly and weakly influence y and the functions that combine these factors EDA examines ɛ to see whether it contains evidence of other important but unmeasured ( hidden ) factors Confirmatory studies test whether x really is a causal factor that influences y Exploratory studies are to confirmatory studies as test kitchens are to cookbook recipes. EDA generally doesn t test hypotheses, but, rather, helps the data tell its story EDA helps you understand phenomena, and suggests hypotheses to test in confirmatory studies. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 6 / 46

Exploratory Data Analysis What is Exploratory Data Analysis? Learning to See Data have structure that is evidence of causal influences. EDA uncovers, exposes, clarifies this structure. EDA is like hunting for fossils it s a skill, and you must learn to see not only what s in front of you, but what lies within data. EDA asks, what do I see, and what does it mean? Like any other skill, EDA takes a lot of practice. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 7 / 46

Exploratory Data Analysis Load Some Data > read.table.ista370<-function(filename){ dataurl<-"http://www.sista.arizona.edu/~cohen/ista%20370/d # Reads a data frame from a URL path rooted at ISTA370 dat read.table(paste(dataurl,filename,sep="")) } > > # taheri<-read.table.ista370("taheri1.txt") > # iris<-read.table.ista370("iris.txt") > # heightc<-read.table.ista370("heightc.txt") > # treering<-read.table.ista370("treering1.txt") > # blast<-read.table.ista370("blastsummary.txt") > # kinect<-read.table.ista370("onemovie.txt") > # readability<-read.table.ista370("readability.txt") Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 8 / 46

What Do You See? What Does it Mean? > hist(iris$petal.length,col="grey",main=null) Frequency 0 10 20 30 1 2 3 4 5 6 7 iris$petal.length Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 9 / 46

What Do You See? What Does it Mean? > ipl<-iris$petal.length > hist(ipl,prob="true",ylim=c(0,1),main=null) > lines(density(ipl[iris$species=="setosa"]),col="red") > lines(density(ipl[iris$species=="versicolor"]),col="green") > lines(density(ipl[iris$species=="virginica"]),col="blue") Density 0.0 0.2 0.4 0.6 0.8 1.0 Looking at density curves for each species, we see that the histogram did indeed indicate two or more separate populations (species). 1 2 3 4 5 6 7 ipl Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 10 / 46

What Do You See? What Does it Mean? > boxplot(iris$petal.length~iris$species, ylab="petal.length",xlab="species") Petal.Length 1 2 3 4 5 6 7 A boxplot by species confirms, and summarizes the petal length statistics for each species. setosa versicolor virginica Species Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 11 / 46

Boxplots outliers whisker (various interpreta1ons) upper quar,le (75% quan,le) median interquar,le range lower quar,le (25% quan,le) whisker outliers Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 12 / 46

Median, Quartiles, Interquartile Range If you sort the values in a sample from lowest to highest, the median is the middle value, or the average of the two middle values when the sample contains an even number of points. The median is the 50th quantile, or the value for which 50% of the values are greater. The lower quartile is the 25th quantile, above which 75% of the values are found. The upper quartile is the 75th quantile, above which 25% of the values are found. The interquartile range is a measure of variability and is the difference between the upper and lower quartiles. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 13 / 46

Median, Quartiles, Interquartile Range The median is robust against outliers; the mean is not. Suppose 100 families in a neighborhood each make $40,000/year. When a millionaire moves in the mean jumps from $40,000 to $49,504/year. What happens to the median? Before the millionaire arrived, the variance in income was zero. Afterwards the variance is over nine billion!!! What happens to the interquartile range? Suppose you have a sorted sample of 9 elements; the median is the fifth element. If you add another element, what will the median be? By how many locations in the sorted distribution can the median shift? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 14 / 46

Symmetry and Skew > with(blast, hist(test0,breaks=20,col="grey",main=null)) > with(treering, hist(width,breaks=20,col="grey",main=null)) Frequency 0 10 20 30 40 Frequency 0 10 20 30 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Test0 40 60 80 100 120 width Test0 is skewed to the right, meaning it has a long tail of higher values, while Treering is nearly symmetric. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 15 / 46

Transformations > attach(blast) > hist(train0,breaks=20,col="grey",main=null) > Train0Squared<-Train0^2 #square the Train0 data > hist(train0squared,breaks=20,col="grey",main=null) Frequency 0 10 20 30 40 Frequency 0 5 10 15 20 25 30 0.4 0.6 0.8 1.0 Train0 0.2 0.4 0.6 0.8 1.0 Train0Squared A simple transformation amplifies an otherwise hidden feature Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 16 / 46

Transformations > Train0Squared<-with(blast,Train0^2) > with(blast,plot(density(train0squared))) > with(blast,lines(density(train0squared[gender=="female"]),c > with(blast,lines(density(train0squared[gender=="male"]),col density.default(x = Train0Squared) Density 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 N = 187 Bandwidth = 0.04378 Does gender explain the bump? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 17 / 46

What Explains the Bump New Topics > plot(train0squared,newskills0,col=gender) > plot(newskills0,train0squared,col=gender) Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 18 / 46 The number of topics to which students were exposed (NewSkills0) seems to explain the bump, but gender doesn t. 2 4 6 8 10 12 14 0.2 0.4 0.6 0.8 1.0 NewSkills0 Train0Squared 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 Train0Squared NewSkills0

What Explains the Bump New Topics > precocious<-newskills0>8 > plot(density(train0squared)) > lines(density(train0squared[precocious=="true"]),col="red") > lines(density(train0squared[precocious=="false"]),col="blue density.default(x = Train0Squared) Density 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 N = 187 Bandwidth = 0.04378 So the students who saw too many subjects account for the bump. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 19 / 46

Boxplots instead of density plots > boxplot(train0squared~precocious, xlab="precocious", ylab="proportion training items correct" ) proportion training items correct 0.2 0.4 0.6 0.8 1.0 FALSE TRUE precocious Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 20 / 46

What Explains the Bump Why did some students see hard problems? Exploratory data analysis helped us find and amplify an odd feature of data: Some students saw too many hard problems while training for the first test. How did this happen? > table(precocious, policy) policy precocious DBN_11 EXPERT RANDOM FALSE 39 119 4 TRUE 0 0 25 All these precocious students were in one condition of the experiment. In the RANDOM condition, training problems were selected at random, so we shouldn t be surprised that students in this condition bombed on the first test! Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 21 / 46

Data, revisited Outline The purpose of exploratory data analysis Learning to see: Histograms, boxplots, median and other robust statistics, transforming data...missing values, tips. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 22 / 46

What Do You See? What Does it Mean? > ts.plot(kinect$lhand.y,ylim=c(-500,1000),col="red") kinect$lhand.y 500 0 500 1000 0 50 100 150 Time Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 23 / 46

What Do You See? What Does it Mean? > ts.plot(kinect$lhand.y,ylim=c(-500,1000),col="red") > lines(rep(0,150)) kinect$lhand.y 500 0 500 1000 0 50 100 150 Constants and lack of change are rare in biometric data. Zero is a special number. Perhaps the Kinect codes missing data as 0. What is really happening around time 115? Time Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 24 / 46

Missing Data Data can be missing for many reasons (e.g., subjects in BLAST experiment were allowed to leave once they hit a maximum score, so didn t take all tests) Missing data is usually given a code, such as -999 or NA. The Kinect code of zero is unhelpful. Why? R sometimes refuses to do math on random variables with missing values. Is this unhelpful? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 25 / 46

Missing Data in R > # won't work, missing values: > with(blast,mean(test3)) [1] NA > # this time, exclude missing values: > with(blast,mean(test3,na.rm=true)) [1] 0.3843575 > # get their indices: > with(blast,which(is.na(test3))) [1] 17 39 41 55 73 78 142 179 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 26 / 46

Missing Data: It Matters Why! Most experiment results are based on the assumption of random sampling or random assignment to conditions. When data are Missing Completely At Random (MCAR), missing data don t violate these assumptions. MCAR means missingness is independent of measured and unmeasured factors. Data are Missing at Random (MAR) when the reason they are missing has nothing to do with the data themselves. If food poisoning is proportional to fast food consumption (FFC), but FFC is unrelated to enrollment in ISTA 370, then if you re absent due to food poisoning, then you are MAR. NMAR data are not missing at random. If students ordinarily take four tests, but are excused from future tests if they get a perfect score on a test, are they MAR or NMAR? How might NMAR introduce errors in analysis? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 27 / 46

Not Missing At Random (NMAR) Data Participants in the BLAST experiment took four tests, but those who scored 18 or more on any test were excused from later tests. > which(is.na(test3)) # who didn't take Test3 [1] 17 39 41 55 73 78 142 179 > Test2[which(is.na(Test3))] # what were their Test2 scores [1] 1.00 NA 0.95 NA 1.00 0.90 1.00 1.00 > T3<-Test3 > mean(t3,na.rm=true) # Mean Test3 scores [1] 0.3843575 > T3[is.na(T3)]<-1 # Replace NAs with perfect scores > mean(t3) # Mean T3 scores [1] 0.4106952 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 28 / 46

Not Missing At Random (NMAR) Data Does it Matter? If NMAR data are evenly distributed over experimental conditions, then they might not matter so much. So let s check: > nottest3<-which(is.na(test3)) # who didn't take Test3 > gender[nottest3] [1] female male male male female male male male Levels: female male > cond.plus.policy[nottest3] [1] DBN_11_NO_CHOICE DBN_11_NO_CHOICE DBN_11_NO_CHOICE [4] EXPERT_NO_CHOICE EXPERT_NO_CHOICE EXPERT_NO_CHOICE [7] EXPERT_CHOICE EXPERT_CHOICE_ZPD 6 Levels: DBN_11_NO_CHOICE... RANDOM_NO_CHOICE Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 29 / 46

Not Missing At Random (NMAR) Data Does it Matter? > aggregate.data.frame(test3,by=list(condition,gender), FUN=mean, na.rm=true) Group.1 Group.2 x 1 CHOICE female 0.3939024 2 NO_CHOICE female 0.3139535 3 CHOICE male 0.3901961 4 NO_CHOICE male 0.4375000 > aggregate.data.frame(t3,by=list(condition,gender), FUN=mean, na.rm=true) Group.1 Group.2 x 1 CHOICE female 0.3939024 2 NO_CHOICE female 0.3444444 3 CHOICE male 0.4132075 4 NO_CHOICE male 0.4843750 Note useful aggregate.data.frame command, which applies FUN to subsets of a variable defined by by=... T3 sets all the NAs to 1.0 so it s what students would get if they maxed out tests they were allowed to skip. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 30 / 46

Missing and Censored Data When a small fraction of your data is missing, you can ignore it or impute its values. Common imputation methods involve matching records that have missing data to records that don t, and using one or more of the complete records to infer the missing value. Censored data is a more challenging problem. Examples: Inferring average runtime of an algorithm for a batch of jobs that s allowed to run for a fixed time. Inferring age at death of a treatment sample when some people are still alive when the study ends. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 31 / 46

What Do You See? What Does it Mean? > hist(heightc$weight,main=null,xlab="weight of 33 College St Frequency 0 2 4 6 8 10 50 100 150 200 Weight of 33 College Students Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 32 / 46

What Do You See? What Does it Mean? Frequency 0 2 4 6 8 10 Common sense tells us that a weight of 50lbs or less is unlikely and is probably an errorful datum. 50 100 150 200 Weight of 33 College Students Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 33 / 46

What Do You See? What Does it Mean? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 34 / 46

What Do You See? What Does it Mean? The vertical axes are different, so it s hard to tell what s happening. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 35 / 46

What Do You See? What Does it Mean? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 36 / 46

What Do You See? What Does it Mean? Not all apparent differences are real differences. Mentally group your data to see if something might explain apparent differences. Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 37 / 46

What Do You See? What Does it Mean? > m<-c(0.93,0.95,0.94,0.86,na,0.86, 0.89, 0.85, 0.90, 0.85, 0 > s<-c(0.26, 0.23, 0.25, 0.34,NA,0.34, 0.31, 0.35, 0.30, 0.35 > barx <- barplot(m,ylim=c(0,1.5),names.arg=1:11,axis.lty=1,x > error.bar(barx,m,s) Not all differences are real differences 0.0 0.4 0.8 1.2 1 2 3 4 5 6 7 8 9 10 11 Condition Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 38 / 46

What Do You See? What Does it Mean? > plot(taheri$a,col="red",type="l",ylim=c(0,1)) > lines(taheri$b,col="blue") taheri$a 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 Index Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 39 / 46

What Do You See? What Does it Mean? > plot(taheri$a,col="red",type="l",ylim=c(0,1)) > lines(taheri$b,col="blue") > cor(taheri$a,taheri$b) [1] -0.93024 taheri$a 0.0 0.2 0.4 0.6 0.8 1.0 Although features A and B were supposed to be independent, they were normalized to sum to a constant, rendering them dependent. 2 4 6 8 Index Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 40 / 46

What Do You See? What Does it Mean? > with(mtcars,plot(disp,mpg)) > cor(disp,mpg) [1] -0.8475514 mpg 10 15 20 25 30 100 200 300 400 disp Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 41 / 46

What Do You See? What Does it Mean? > palette(c("blue","red","forestgreen")) > with(mtcars,plot(disp,mpg,col=cyl)) > cor(disp,mpg) [1] -0.8475514 mpg 10 15 20 25 30 Coloring by a third variable shows that the correlation between disp and mpg depends on cyl. 100 200 300 400 disp Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 42 / 46

Tips for looking at data Tips: Not All Data are Real Data Data can be noisy, missing, contaminated, or even intentionally wrong (e.g., perverse subjects). You rarely know which data are suspicious. You have to look carefully for strange values and decide what to do with them. kinect$lhand.y 500 0 500 1000 Frequency 0 2 4 6 8 10 0 50 100 150 Time 50 100 150 200 Weight of 33 College Students Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 43 / 46

Tips for looking at data Tips: Not All Differences are Real Differences Get the right vertical axis scale Augment your picture with measures of variation 0.0 0.4 0.8 1.2 1 2 3 4 5 6 7 8 9 10 11 Condition Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 44 / 46

Tips for looking at data Tips: Look for holes in data Holes are regions that have fewer data. You ask, why are so few data there? Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 45 / 46 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 Train0Squared NewSkills0 iris$petal.length Frequency 1 2 3 4 5 6 7 0 10 20 30

Tips for looking at data Tips: Don t rely on summaries. Look at the data! Means and other summaries are useful but the raw data show patterns obscured by summaries. unemployment 4 5 6 7 8 9 1990 1995 2000 2005 2010 year Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 46 / 46