Univariate Descriptive Statistics

Similar documents
Exercise 1.12 (Pg )

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Exploratory data analysis (Chapter 2) Fall 2011

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

MEASURES OF VARIATION

Using SPSS, Chapter 2: Descriptive Statistics

Northumberland Knowledge

Descriptive Statistics

Data Exploration Data Visualization

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Statistics Revision Sheet Question 6 of Paper 2

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

How To Write A Data Analysis

Variables. Exploratory Data Analysis

Geostatistics Exploratory Analysis

Characteristics of Binomial Distributions

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

3: Summary Statistics

Means, standard deviations and. and standard errors

Descriptive Statistics

Summarizing and Displaying Categorical Data

Describing, Exploring, and Comparing Data

Exploratory Data Analysis. Psychology 3256

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Exploratory Data Analysis

AP * Statistics Review. Descriptive Statistics

Introduction; Descriptive & Univariate Statistics

Lecture 1: Review and Exploratory Data Analysis (EDA)

Bar Graphs and Dot Plots

WEEK #22: PDFs and CDFs, Measures of Center and Spread

Mathematical goals. Starting points. Materials required. Time needed


MBA 611 STATISTICS AND QUANTITATIVE METHODS

Foundation of Quantitative Data Analysis

Week 1. Exploratory Data Analysis

Topic 9 ~ Measures of Spread

Diagrams and Graphs of Statistical Data

Measurement with Ratios

Module 4: Data Exploration

Chapter 1: Exploring Data

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Lecture 2. Summarizing the Sample

430 Statistics and Financial Mathematics for Business

Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log

a. mean b. interquartile range c. range d. median

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Dongfeng Li. Autumn 2010

Interpreting Data in Normal Distributions

Students summarize a data set using box plots, the median, and the interquartile range. Students use box plots to compare two data distributions.

Coins, Presidents, and Justices: Normal Distributions and z-scores

Content Sheet 7-1: Overview of Quality Control for Quantitative Tests

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

First Midterm Exam (MATH1070 Spring 2012)

Chapter 2 Data Exploration

determining relationships among the explanatory variables, and

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

MTH 140 Statistics Videos

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

AP STATISTICS REVIEW (YMS Chapters 1-8)

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Section 1.3 Exercises (Solutions)

Statistics. Measurement. Scales of Measurement 7/18/2012

STAT355 - Probability & Statistics

Scatter Plots with Error Bars

2. Filling Data Gaps, Data validation & Descriptive Statistics

Descriptive statistics parameters: Measures of centrality

THE BINOMIAL DISTRIBUTION & PROBABILITY

Basics of Statistics

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

The Dummy s Guide to Data Analysis Using SPSS

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Chapter 4: Average and standard deviation

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

CALCULATIONS & STATISTICS

Mind on Statistics. Chapter 2

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Algebra I Vocabulary Cards

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis

Introduction to Quantitative Methods

Fairfield Public Schools

Frequency Distributions

Descriptive Statistics and Measurement Scales

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Ch. 3.1 # 3, 4, 7, 30, 31, 32

Transcription:

Univariate Descriptive Statistics Displays: pie charts, bar graphs, box plots, histograms, density estimates, dot plots, stemleaf plots, tables, lists. Example: sea urchin sizes Boxplot Histogram Urchin Size (mm) 0 10 20 30 40 50 60 Number of Urchins 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 Urchin Size (mm) Dot Plot Density Density 0.000 0.005 0.010 0.015 0 10 20 30 40 50 60 20 0 20 40 60 80 Urchin Size (mm) Urchin Size (mm) 13

Points: 1) Useful for quantitative variables. 2) Boxplot shows five point summary: minimum, first quartile, median, third quartile, maximum. 3) Dot Plot illegible with 250 data points. (1 dot for each size plotted on line.) 4) Histogram, density plot serve similar purposes. 5) Density goes below 0: bad. 6) Histogram doesn t show clustering density plot shows. 14

Example: Categorical: Weather in Central Park Pie Chart Bar Graph partly.cloudy clear cloudy 0 2 4 6 8 10 clear partly.cloudy cloudy Pie chart harder to read. General summary: Pie Charts are bad. More useful with more categories. Ordering of categories important for nominal variables. Cloudiness is ordinal. 15

Pie charts: wedge has area proportional to # of individuals in category. Bar chart: bar has height equal to # of individuals in category. Density estimates not discussed in this course. Histogram: 1) divide range of values into intervals. 2) Count numbers of individuals in each interval. 3) bar AREA is proportional to # of individuals in interval; width is length of interval. 4) equal width bars best then height proportional to # of individuals. 5) label x-axis; include units. 6) label y-axis. 16

Example: Personal Income for BC (ages 15+). (For those with income.) Source: 2001 Census. Adult Personal Income (BC) 0.00 0.01 0.02 0.03 0 20 40 60 80 100 Income ($000s) 17

Points 1) Bar widths unequal census tables given that way. 2) So take width times height to get area = fraction of population in that income group. 3) Last group on right open ended artificially cut off at $100,000 by me. 4) Plot is long-tailed to the right or skewed to the right. 5) Based on 20% sample of 1,523,720 people aged 15 + in BC on census day, 2001. 18

Comparison of 1996, 2001. 1996 Income Density 0 20 40 60 80 100 2001 Income Density 0 20 40 60 80 100 19

Summarizing the pictures. Purposes: less space in text than a graph; precise numerical comparison between groups. Summarizing a histogram: Where is centre of the x-axis values? Jargon: location or centre. How far do the x values extend on either side? Jargon: spread, variation, width. Is the picture symmetric or does it extend farther to right than left? Location and number of bumps. 20

Measures of location: Mean, Arithmetic Mean, Average, Arithmetic Average: total of x-values divided by number of x values. Histogram balances at mean. (First Moment in physics.) Think of See-Saw: small kid far from centre balances big kid close to centre. Formula: data X 1,..., X n. X = ni=1 X i n Utility of summation notation in this course: NIL. But X is standard notation for average of X. Median: number such that 1/2 of X values at least that large, and 1/2 of X values at least that small. Sort list: if n is odd median is middle of sorted list. If n is even take average of two middle values. 21

Numerical examples: ages in my family: 50,50,20,15,8,8. Ā = 50 + 50 + 20 + 15 + 8 + 8 6 = 151 6 25.2 Median age: middle numbers are 15, 20. Halfway between is median = 17.5. Mode: most common value. Not useful concept in most cases. Location of tallest bar in histogram (affected by definition of classes). Mode of ages is not unique: 50 or 8. Not useful summary of centre. 22

Comparison: Advantages of mean: 1) if your average weekly income is $100 you know how you will do in the long run; not so if median weekly income is $100. 2) Same point: average and sample size tells you total. 3) Has simpler mathematical behaviour than median. Advantages of median: Not influenced by extreme members of list. Median income, for instance, gives more information about typical person. 23

Measures of spread: Standard Deviation Interquartile Range Mean Absolute Deviation. Deviations from the mean: subtract mean from each number in list: X i X. For my family deviations are 24.8, 24.8, 5.2, 10.2, 17.2, 17.2. Summarize size of deviations: Average is 0. Not useful as measure of size since pluses cancel minuse. 24

Mean absolute deviation: take absolute values (ignore signs) and average 24.8 + 24.8 + 5.2 + 10.2 + 17.2 + 17.2 6 = 16.6 years Standard deviation: square deviations, average, take square root: (24.8) 2 + + ( 17.2) 2 s = 5 = 19.8 years. WARNING: notice the 5 not 6. This is Traditional. Not important in large data sets. Jargon: variance is s 2 : s 2 (24.8) 2 + + ( 17.2) 2 = 5 = 390.6 years 2 25

Interquartile Range: First define quartiles, quintiles, etc. First, second and third quartiles split list into 4 equal pieces. One quarter of list below first quartile, two quarters below second, three quarters below third. Second quartile is median. Interquartile range is third quartile minus first quartile. Book gives method to find quartiles. Quintiles split list into 5 equal parts. Percentiles split list into 100 equal parts. 26

Comparison: Advantages of IQR: like median not influenced by extremes. Easily related to proportions of population. But: rather than use 2 number summary (median, IQR) typically use 3 number summary (quartiles) or 5 number summary (min, max, quartiles). Boxplot is graph of 5 number summary. Advantages of Mean Absolute Deviation. Seems intuitive. Less influenced by extremes than Standard Deviation. But: poor mathematical properties. We mostly use Standard Deviation. 27

Why the Standard Deviation? Usual explanation: squares nicer mathematically than absolute values. Real explanation (WARNING: personal view): ONLY the SD works in normal approximations for sums. Normal approximations? A common summary for curves. Rule of thumb: in many lists of data about 2/3 of the observations are within 1 SD of the mean, about 95% within 2 SDs of the mean and almost all within 3 SDs of the mean. NEXT TOPIC: the normal curve. (bell curve, Gaussian) 28