Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures



Similar documents
Summarizing and Displaying Categorical Data

Chapter 2: Frequency Distributions and Graphs

Diagrams and Graphs of Statistical Data

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Exercise 1.12 (Pg )

Darton College Online Math Center Statistics. Chapter 2: Frequency Distributions and Graphs. Presenting frequency distributions as graphs

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Statistics Chapter 2

MODUL 8 MATEMATIK SPM ENRICHMENT TOPIC : STATISTICS TIME : 2 HOURS

Exploratory Data Analysis

Exploratory data analysis (Chapter 2) Fall 2011

Statistics Revision Sheet Question 6 of Paper 2

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Data exploration with Microsoft Excel: analysing more than one variable

Appendix 2.1 Tabular and Graphical Methods Using Excel

2 Describing, Exploring, and

Demographics of Atlanta, Georgia:

How To Write A Data Analysis

Data Exploration Data Visualization

Exploratory Spatial Data Analysis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Variable: characteristic that varies from one individual to another in the population

Data exploration with Microsoft Excel: univariate analysis

Sta 309 (Statistics And Probability for Engineers)

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Module 2: Introduction to Quantitative Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Common Tools for Displaying and Communicating Data for Process Improvement

Descriptive Statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Section 1.1 Exercises (Solutions)

Lecture 1: Review and Exploratory Data Analysis (EDA)

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Graphics in R. Biostatistics 615/815

Using SPSS, Chapter 2: Descriptive Statistics

Descriptive Statistics

Exploratory Data Analysis

Variables. Exploratory Data Analysis

MATH 103/GRACEY PRACTICE EXAM/CHAPTERS 2-3. MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

What Does the Normal Distribution Sound Like?

Introduction Course in SPSS - Evening 1

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Fairfield Public Schools

Visualizations. Cyclical data. Comparison. What would you like to show? Composition. Simple share of total. Relative and absolute differences matter

Describing, Exploring, and Comparing Data

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

Session 7 Bivariate Data and Analysis

% 40 = = M 28 28

Week 1. Exploratory Data Analysis

Intro to Statistics 8 Curriculum

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Introduction to Data Visualization

MARS STUDENT IMAGING PROJECT

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Modifying Colors and Symbols in ArcMap

Interpreting Data in Normal Distributions

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

MTH 140 Statistics Videos

Visualization Quick Guide

Correlation and Regression

Instruction Manual for SPC for MS Excel V3.0

A Picture Really Is Worth a Thousand Words

DATA INTERPRETATION AND STATISTICS

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Descriptive Statistics and Exploratory Data Analysis

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Data Visualization in R

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

Bar Charts, Histograms, Line Graphs & Pie Charts

Viewing Ecological data using R graphics

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Analyzing Experimental Data

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

AP STATISTICS REVIEW (YMS Chapters 1-8)

STAB22 section 1.1. total = 88(200/100) + 85(200/100) + 77(300/100) + 90(200/100) + 80(100/100) = = 837,

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Normal Approximation. Contents. 1 Normal Approximation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Algebra 1 Course Information

Examples of Data Representation using Tables, Graphs and Charts

with functions, expressions and equations which follow in units 3 and 4.

The Comparisons. Grade Levels Comparisons. Focal PSSM K-8. Points PSSM CCSS 9-12 PSSM CCSS. Color Coding Legend. Not Identified in the Grade Band

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Graphing in SAS Software

Lesson Using Lines to Make Predictions

Bar Graphs and Dot Plots

Practice#1(chapter1,2) Name

Formulas, Functions and Charts

a. mean b. interquartile range c. range d. median

Common Core Unit Summary Grades 6 to 8

Drawing a histogram using Excel

Transcription:

Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the author 2009 (Compile date: Tue May 19 14:48:11 2009) Contents 1 Visualizing Data 1 1.1 Frequency distributions: univariate data. 2 Standard frequency distributions.... 3 Relative frequency distributions.... 4 Cumulative frequency distributions.. 5 1.2 Visualizing univariate quantitative data.... 5 Histograms....... 5 Dot plots........ 7 Stem and leaf plots... 7 1.3 Visualizing univariate qualitative data..... 8 Bar plot......... 8 Pareto chart....... 9 1.4 Visualizing bivariate quantitative data.... 11 Scatter plots...... 11 Time series plots.... 12 1.5 Summary........ 13 1 Visualizing Data The Six Steps in Statistics 1. Information is needed about a parameter, a relationship, or a hypothesis needs to be tested. 2. A study or experiment is designed. 3. Data is collected. 4. The data is described and explored. 5. The parameter is estimated, the relationship is modeled, the hypothesis is tested. 6. Conclusions are made. 1

2 of 14 1.1 Frequency distributions: univariate data Descriptive statistics Descriptive statistics. Definition 1.1 summarize and describe collected data (sample). Contrast with inferential statistics which makes inferences and generalizations about populations. Useful characteristics for summarizing data Example 1 (Heights of students in class). If we are interested in heights of students and we measure the heights of students in this class, how can we summarize the data in useful ways? What to look for in data shape are most of the data all around the same value, or are they spread out evenly between a range of values? How is the data distributed? center what is the average value of the data? variation how much do the values vary from the center? outliers are any values significantly different from the rest? Class height data Load the class data, then look at what variables are defined. R: load ( ClassData. RData ) R: l s ( ) [ 1 ] c l a s s. data R: names ( c l a s s. data ) [ 1 ] gender height [ 3 ] forearm height mother [ 5 ] c o r r e c t i v e l e n s e s h a i r c o l o r [ 7 ] t r a n s p o r t a t i o n work hours [ 9 ] c r e d i t s R: c l a s s. data $ h e i g h t [ 1 ] 65 68 71 66 68 65 62 68 77 62 69 70 65 63 67 73 68 70 Task: describe student heights 1. Determine general distribution (shape) of data. 2. Find a measure(s) of the center. 3. Find a measure of the variation. 4. Look for outliers. 1.1 Frequency distributions: univariate data Definition 1.2 Definition 1.3 Definition 1.4 univariate data. measurements made on only one variable per observation. bivariate data. measurements made on two variables per observation. multivariate data. measurements made on many variables per observation.

Visualizing Data 3 of 14 STANDARD FREQUENCY DISTRIBUTIONS frequency distribution. Definition 1.5 A table listing the frequency (number of times) data values occur in each interval. Good first summary of data! Steps: 1. Choose number of classes (typically 5-20) 2. Calculate the class width: class width max value min value number of classes (1) 3. Choose starting point, typically min data value or 0. 4. List in table lower class limits and then upper limits 5. Tally the data in each class, can use tick marks. Given: R: c l a s s. data $ h e i g h t [ 1 ] 65 68 71 66 68 65 62 68 77 62 69 70 65 63 67 73 68 70 (min=62, max=77, n=18) Question 1. Construct a frequency table for the class heights:

4 of 14 1.1 Frequency distributions: univariate data Information in a frequency distribution 1. shape 2. center 3. variation 4. outliers RELATIVE FREQUENCY DISTRIBUTIONS Definition 1.6 Relative frequency distribution. A table of relative frequency counts. Useful for comparing data. relative frequency % = class freq 100% (2) total number Example 2. Add a relative frequency column onto the student height table.

Visualizing Data 5 of 14 CUMULATIVE FREQUENCY DISTRIBUTIONS Cumulative frequency distribution. Definition 1.7 Table listing the total counts less than the upper class limits for each interval. Steps to construct: 1. Make a frequency distribution 2. Make a second table and change the intervals to less than [upper limit]. 3. Each frequency is the sum of all the previous frequencies. We accumulate the frequencies. Example 3. Make a cumulative frequency distribution for student heights. 1.2 Visualizing univariate quantitative data HISTOGRAMS Histogram. Definition 1.8 A bar plot of a frequency histogram table. x-axis class boundaries (or midpoints) y-axis frequency (counts) (No spaces between bars.) Histogram: hist(x) Where x is a vector of data. Example 4. Histogram of student heights R: h i s t ( c l a s s. data $ height, main = Student h e i g h t s ) Student heights Frequency 0 1 2 3 4 5 65 70 75 class.data$height Example 5. Comparison of student heights versus gender:

6 of 14 1.2 Visualizing univariate quantitative data R: par ( mfrow = c ( 1, 2) ) R: h i s t ( c l a s s. data $ h e i g h t [ c l a s s. data $ gender == MALE ], + main = Male s t u d e n t s, xlab = height ) R: h i s t ( c l a s s. data $ h e i g h t [ c l a s s. data $ gender == FEMALE ], + main = Female s t u d e n t s, xlab = height ) Male students Female students Frequency 0.0 0.5 1.0 1.5 2.0 Frequency 0.0 1.0 2.0 3.0 66 68 70 72 74 76 78 height 62 64 66 68 height Normal Shaped Distribution Example of a normal shaped histogram (sometimes called a bell shape ). Normal Shape Histogram Density 0.00 0.02 0.04 20 30 40 50 60 70 80 x Red Line: Ideal Normal Distribution

Visualizing Data 7 of 14 DOT PLOTS STEM AND LEAF PLOTS Stem and leaf plot. Definition 1.9 A character based plot similar to a histogram that breaks up data into the stem and leaf plot. Quick and easy histogram. stem like the classes of a histogram, generally the most significant digit of the data. leaf digit to the right of the stem. For each data point it s leaf value (digit) is recorded to the right of the stem. More leaves indicate more values in the stem s class. Stem and leaf plot: stem(x) Where x is a vector of data. Example 6. Stem and leaf plot of student heights R: stem ( c l a s s. data $ h e i g h t ) The decimal p o i n t i s 1 d i g i t ( s ) to the r i g h t o f the 6 223 6 5556788889 7 0013 7 7 Below is a stem and leaf plot for a small set of data stored in x. R: stem ( x ) The decimal p o i n t i s at the 5 558 6 04 7 8 3 Question 2. Reconstruct the data set from the above plot. Other ways to visualize frequency data relative frequency histogram bar plot with vertical axis in percent rather than counts. frequency polygon just like a histogram but uses line segments instead of bars. ogive oh-jive line graph of cumulative frequencies.

8 of 14 1.3 Visualizing univariate qualitative data 1.3 Visualizing univariate qualitative data Univariate qualitative data is categorical data concerning one variable. Example 7 (Univariate qualitative data). student hair color is a single variable that is qualitative. R: c l a s s. data $ h a i r c o l o r [ 1 ] BROWN BROWN BLACK BLOND BROWN BLACK BROWN BROWN BROWN [ 1 0 ] BROWN BLOND BROWN BROWN BROWN BROWN BROWN BLACK BROWN L e v e l s : BLACK BLOND BROWN Summarizing qualitative data Frequency table: table(x) Where x is a vector of data. Example 8. Summarizing student hair color. R: t a b l e ( c l a s s. data $ h a i r c o l o r ) BLACK BLOND BROWN 3 2 13 We will now look at methods for visualizing these types of tables. BAR PLOT Definition 1.10 Bar plot. Like a histogram but for qualitative data frequencies. Each bar represents the count of a category. x-axis categories y-axis counts or proportions. Misleading if y-axis does not start at 0. Makes sense to put spaces between bars since data is not continuous. Bar plot: t=table(x); barplot(t) Where x is a vector of categorical data and we plot the frequency table t. Example 9. Bar plot of student hair color R: t = t a b l e ( c l a s s. data $ h a i r c o l o r ) R: b a r p l o t ( t, main = Student Data )

Visualizing Data 9 of 14 Student Data 0 2 4 6 8 10 12 BLACK BLOND BROWN PARETO CHART Pareto chart. Definition 1.11 A bar plot of categorical data frequencies sorted in decreasing frequency. Pareto chart: t=table(x); t=sort(t, decreasing=true); barplot(t) Where x is a vector of categorical data and we plot the sorted frequency table t. Example 10. Pareto chart of student hair color. R: t = t a b l e ( c l a s s. data $ h a i r c o l o r ) R: t = s o r t ( t, d e c r e a s i n g = TRUE) R: t BROWN BLACK BLOND 13 3 2 R: b a r p l o t ( t, main = Class Data )

10 of 14 1.3 Visualizing univariate qualitative data Class Data 0 2 4 6 8 10 12 BROWN BLACK BLOND Pie charts Definition 1.12 Pie chart. Used to display relative frequencies (or proportions) for categorical data. The area represents the relative frequency. Although widely found in the media, pie charts are no longer commonly used by statisticians since it s hard to gauge the areas visually. Pie chart: t=table(x); pie(t) Where x is a vector of categorical data and we plot the frequency table t. Example 11. Pie chart of student hair color. R: t = t a b l e ( c l a s s. data $ h a i r c o l o r ) R: p i e ( t, main = Student Data )

Visualizing Data 11 of 14 Student Data BLOND BLACK BROWN 1.4 Visualizing bivariate quantitative data Bivariate quantitative data. Definition 1.13 is paired data dealing with two variables. For each measurement of the first variable there is a corresponding measurement in the second variable. Example 12 (Bivariate data). We may be interested to study the relationship between height and weight of students. For each student in our study we measure both variables, the height and weight. We will not look at ways to visualize bivariate quantitative data. SCATTER PLOTS Scatter plot. Definition 1.14 A plot of paired (x, y) data. The two variables are represented by x and y. horizontal axis x values. vertical axis y values. Scatter plot: plot(x,y) Where both x and y are ordered vectors of data. Ordered meaning the first element of x corresponds to the first element of y. Example 13. Scatter plot of student height vs forearm length. R: p l o t ( c l a s s. data $ height, c l a s s. data $ forearm, main = Student Data )

12 of 14 1.4 Visualizing bivariate quantitative data Student Data class.data$forearm 8 10 12 14 16 65 70 75 class.data$height TIME SERIES PLOTS Definition 1.15 Time series plot. Plots data points as a function of time. x-axis time y-axis values of measured variable. Example 14. Measure the height of a bean sprout over time. Plot the pairs of data (time, height) using a scatter diagram with x variable being time. Time series plot plot: plot(t,y, type="b") Where both t and y are ordered vectors of data. Ordered meaning the first element of t corresponds to the first element of y. The optional argument type="b" makes a plot with both points and lines. Example 15. Time series plot of Nile River flow data. R: date = 1871: 1970 R: f l o w = c (1120, 1160, 963, 1210, 1160, 1160, 813, + 1230, 1370, 1140, 995, 935, 1110, 994, 1020, + 960, 1180, 799, 958, 1140, 1100, 1210, 1150, + 1250, 1260, 1220, 1030, 1100, 774, 840, 874, + 694, 940, 833, 701, 916, 692, 1020, 1050, + 969, 831, 726, 456, 824, 702, 1120, 1100, + 832, 764, 821, 768, 845, 864, 862, 698, 845, + 744, 796, 1040, 759, 781, 865, 845, 944, 984, + 897, 822, 1010, 771, 676, 649, 846, 812, 742, + 801, 1040, 860, 874, 848, 890, 744, 749, 838, + 1050, 918, 986, 797, 923, 975, 815, 1020, + 906, 901, 1170, 912, 746, 919, 718, 714, 740)

Visualizing Data 13 of 14 R: p l o t ( date, flow, type = b, main = Annual f l o w o f the r i v e r N i l e at Ashwan 1871 to 1970 ) Annual flow of the river Nile at Ashwan 1871 to 1970 flow 600 800 1200 1880 1900 1920 1940 1960 date If you want to make a time series plot, you will typical define t=1871:1970, y=c(y 1,y 2,...), then plot(t, y, type="b"). 1.5 Summary Summary Univariate quantitative data (look for shape, center, variation, outliers) 1. Frequency table: standard, cumulative, relative 2. Histogram: visual for frequency table hist(x), stem(x) Univariate qualitative data: 1. Frequency table: t=table(x) 2. Pareto chart: barplot(sort(t, decreasing = TRUE)) Bivariate data: 1. Scatter plot: plot(x, y) 2. Time series plot: plot(t, y, type="b") Additional Examples to try Example 16. Take a look at the built in morley table of data in R. This is the classical data from the Michaelson and Morley experiment that measured the speed of light (values are in km/s minus 299,000). The data consists of five experiments, each consisting of 20 consecutive runs. Make a histogram of the speed of light. What does the histogram tell you about the data? Make a stem and leaf plot of the data. How does it compare with the histogram? Are there any outliers?

14 of 14 1.5 Summary What is the range of the data? What value are most of the measurements around? Example 17. Take a look at the built in women table of data in R. This data set gives the average heights and weights for American women aged 30 to 39. Make a scatter plot of height vs. weight. Are any trends visible?