Descriptive Statistics



Similar documents
Descriptive Statistics

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Variables. Exploratory Data Analysis

AP Statistics Solutions to Packet 2

Week 3&4: Z tables and the Sampling Distribution of X

3: Summary Statistics

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Exploratory data analysis (Chapter 2) Fall 2011

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Normal distribution. ) 2 /2σ. 2π σ

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

6.2 Normal distribution. Standard Normal Distribution:

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

The Normal Distribution

Non-Parametric Tests (I)

6.4 Normal Distribution

Exercise 1.12 (Pg )

Week 1. Exploratory Data Analysis

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Geostatistics Exploratory Analysis

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

16. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Exploratory Data Analysis

You flip a fair coin four times, what is the probability that you obtain three heads.

Lecture 1: Review and Exploratory Data Analysis (EDA)

How To Test For Significance On A Data Set

Lecture 2. Summarizing the Sample

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

6 3 The Standard Normal Distribution

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

4. Continuous Random Variables, the Pareto and Normal Distributions

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

Chapter 4. iclicker Question 4.4 Pre-lecture. Part 2. Binomial Distribution. J.C. Wang. iclicker Question 4.4 Pre-lecture

Point and Interval Estimates

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Chapter 3. The Normal Distribution

Northumberland Knowledge

4.3 Areas under a Normal Curve

CALCULATIONS & STATISTICS

The Normal distribution

THE BINOMIAL DISTRIBUTION & PROBABILITY

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Lesson 7 Z-Scores and Probability

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

8. THE NORMAL DISTRIBUTION

Lesson 4 Measures of Central Tendency

1 Nonparametric Statistics

Chapter 4. Probability and Probability Distributions

Means, standard deviations and. and standard errors

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Confidence Intervals for the Difference Between Two Means

Characteristics of Binomial Distributions

Mind on Statistics. Chapter 2

Normality Testing in Excel

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Simple Regression Theory II 2010 Samuel L. Baker

Comparing Means in Two Populations

Diagrams and Graphs of Statistical Data

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

AP * Statistics Review. Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics

Sampling Distributions

5.1 Identifying the Target Parameter

2. Filling Data Gaps, Data validation & Descriptive Statistics

Key Concept. Density Curve

First Midterm Exam (MATH1070 Spring 2012)

Section 1.3 Exercises (Solutions)

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Objectives. 6.1, 7.1 Estimating with confidence (CIS: Chapter 10) CI)

Mark. Use this information and the cumulative frequency graph to draw a box plot showing information about the students marks.

Non Parametric Inference

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Probability Distributions

Statistics 2014 Scoring Guidelines

How To Write A Data Analysis

DATA INTERPRETATION AND STATISTICS

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Chapter 5: Normal Probability Distributions - Solutions

Lecture 5 : The Poisson Distribution

Lesson 20. Probability and Cumulative Distribution Functions

Exploratory Data Analysis. Psychology 3256

Interpretation of Somers D under four simple models

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

Normal Distribution Lecture Notes

Measurement with Ratios

Mathematical goals. Starting points. Materials required. Time needed

Mark Scheme (Results) June GCSE Mathematics (1380) Paper 3H (Non-Calculator)

How Does My TI-84 Do That

Dongfeng Li. Autumn 2010

II. DISTRIBUTIONS distribution normal distribution. standard scores

Unit 7: Normal Curves

Transcription:

Descriptive Statistics Suppose following data have been collected (heights of 99 five-year-old boys) 117.9 11.2 112.9 115.9 18. 14.6 17.1 117.9 111.8 16.3 111. 1.4 112.1 19.2 11. 15.4 99.4 11.1 13.3 16.9 18.2 119.3 112. 16.2 15.9 16.9 19.3 15.9 11. 16.7 18.5 17.7 114.3 18.6 14.6 113.7 116.7 13.5 96.1 11.8 97.2 19.6 11.5 15.9 16.2 17.4 114.9 11.3 14.8 99.2 119.2 111.4 13. 11.1 15.8 11.5 15.9 17.6 97.1 113.3 19.4 19.4 11.8 16.3 18.1 19.6 12.4 11.4 11.1 115.3 12.9 111.2 99.4 15.7 119.5 19.3 112.8 18.2 117. 16.8 15.4 18.7 19.2 97.1 13.3 18.8 116.3 115.5 114.9 11.1 14.1 11.8 112.7 15.6 99.9 111.1 19.4 19.1 11.7 Initial impression is of indigestibility 1

Put data into ascending order Sorted data 96.1 11.5 15.4 16.3 18.1 19.3 11.3 111.8 115.3 97.1 12.4 15.4 16.3 18.2 19.4 11.4 112. 115.5 97.1 12.9 15.6 16.7 18.2 19.4 11.5 112.1 115.9 97.2 13. 15.7 16.8 18.5 19.4 11.7 112.7 116.3 99.2 13.3 15.8 16.9 18.6 19.6 11.8 112.8 116.7 99.4 13.3 15.9 16.9 18.7 19.6 11.8 112.9 117. 99.4 13.5 15.9 17.1 18.8 11. 11.8 113.3 117.9 99.9 14.1 15.9 17.4 19.1 11.1 111. 113.7 117.9 1.4 14.6 15.9 17.6 19.2 11.1 111.1 114.3 119.2 11. 14.6 16.2 17.7 19.2 11.1 111.2 114.9 119.3 11.1 14.8 16.2 18. 19.3 11.2 111.4 114.9 119.5 Much more can now be seen but hardly succinct: need simple summaries 2

all values are between 96.1 cm and 119.5 cm middle value is 18.7 cm - indication of location values quarter of way up and down data are 15.6 and 111.1 cm - indication of spread these comprise the five number summary for these data Middle value is known as the Median Also upper and lower quartiles Definitions need some care 3

Median is middle value. Definition of Median For five values there is a definite middle one But not for four values? 4

Definition of Median Median is the middle value of a sample when the sample size is odd Median is the average of the two middle values when sample size is even Definitions for quartiles are similar and given more precisely in appendix 1 of handout (see web page) 5

Graphical displays of the data Display the five figure summary - Box and Whisker plots 12 Top of box is upper quartile Height (cm) 11 Line across box is at level of median 1 Bottom of box is lower quartile Whiskers can extend to maximum and minimum but 6

Alternative Box and Whisker plots 12 Whiskers extend to only a given multiple of the box height Height (cm) 11 This is intended to allows outliers to be seen 1 7

Histograms A useful way to display all the data in a sample: Divide range into bins and plot numbers/fractions in bins Choose bin width judiciously.8.4.6.3 Proportion.4 Proportion.2.2.1 9 1 11 12 Height (cm) 9 1 11 12 Height (cm).8.2.6.15 Proportion.4 Proportion.1.2.5 9 1 11 12 Height (cm) (2, 6 15 and 5 bins: clockwise) 9 1 11 12 Height (cm) 8

Histograms for increasing sample size Samples of sizes 99, 3, 1, 1 99 heights 3 heights.2.15.15.1 Proportion.1 Proportion.5.5 9 1 11 12 Height (cm) 1 heights 9 1 11 12 13 Height (cm) 1 heights.8.8.6.6 Proportion.4 Proportion.4.2.2 9 1 11 12 13 Height (cm) 9 1 11 12 13 Height (cm) Form gets more regular as sample size increases - tends to a limiting form - notion of a population and inferential statistics 9

Populations, Parameters and Inference Sample rarely of interest in its own right Value lies in its representativeness of a greater whole, about which we wish to learn Inferential Statistics seeks to draw conclusions about a population from a sample Sample must be representative of population - random selection Population often defined in terms of unknown parameters µ, σ Estimated by sample analogues or statistics, such as m or s 1

. 9. Describing Populations Distributional form: we consider only the Normal distribution: symmetric bell-shaped curve Associated parameters: measure of location µ (mean) measure of spread σ (SD) σ 95 1 15 11 115 12 µ Height (cm) (e.g.) 11

Sample Estimates of µ and σ Estimate µ by sample mean (arithmetic mean - simple average ) m x 1 + = x 2 +... + x n where x 1, x 2,,x n are the values in the sample (n.b. size n) Estimate of σ is sample standard deviation (SD): n s = 2 2 2 ( x1 m) + ( x2 m) + + ( xn m) n 1 For sample of 99 boys m = 18.34 cm s = 5.21 cm (note units) {cf median 18.7 cm quartiles, 15.6 cm, 111.1 cm IQR = 5.5 cm} 12

little need these days to work directly with these formulae indeed, using a computer is preferable as well as laboursaving appendix 2 describes some peculiarities of SD formula for those interested mean and SD appropriate summaries for a Normal distribution what about non-normal data? 13

Skewed data Medians and quartiles - always legitimate tools means and SDs - less easy to give general answers.4.3 Fraction.2.1 1 2 3 Cumulative dose of phototherapy, J/cm2 Non-Normal data, especially skewed unduly mean influenced by large values in sample 14

Quantitative use of the Normal curve Interpretable feature of curve is area under curve.9 Shaded area = Probability a child's height < X (= 112 cm here).75.6.45.3.15. 9 1 11 12 Height (cm) Area under whole curve is 1: area under curve up to X is P x Probability randomly chosen member of population has value < X 15

Cumulative probabilities Given X, and µ, σ the value of P can be found but you need a computer P is the cumulative probability at X In example µ = 18 cm σ = 4.7 cm and X = 112 cm Gives P =.826 I.e. about 8% of five year old boy have height < 112 cm 16

Some obvious results: Other probabilities probability value is larger than X is 1-P I.e..1974 or about 2% of boys have height above 112cm Symmetry means that it follows that proportion.1974 of boys have height below 14 cm (14 cm is 4 cm below mean, and 112 cm was 4 cm above mean) Allows virtually all other probabilities to be found 17

.9.75.6.45 Q = probability height < 11cm = mean - 7 cm =.68 Q = probability height > 115cm = mean + 7 cm =.68.3.15. 9 1 11 12 Height (cm) µ = 18 cm Can also ask inverse question: given P what is X? For example, 3% of boys have height less than what? I.e. P =.3 and X = 99.16 cm X is the inverse cumulative probability of P. 18

Z-scores P depends on X but there is a simpler structure underlying the relationship. X can be expressed as being so many SDs from the mean I.e. X = µ + Zσ and P depends on X only through Z Allows quicker apprehension of the Normal distribution (e.g. 95% within ± 2 SDs of mean, 68% within ± 1 SD of mean) Z 1 2 1.96.675 2.58 P.5.84.977.975.75.995 Proportion within Z SDs of mean (=2P-1).68.954.95.5.99 19

Why is the Normal Distribution Important? Empirically important - many variables are Normal There are reasons why some variables are Normal e.g. polygenic control - see appendix 3 Means are more Normal 2

Distribution of sample means Population Sample size 1.15.1.1 Fraction Fraction.5.5 1 2 3 X Sample size 5 2 4 6 8 1 Sample means Sample size 1.1.1 Fraction.5 Fraction.5 2 3 4 5 6 Sample means 3 4 5 Sample means Figure 8: histograms of 1 observations form a population and of 5 sample means for samples of size 1, 5 and 1. 21

Assessing Normality Quite a preoccupation - not as necessary as might be thought cf. Distribution-free or non-parametric methods Many methods that assume Normality are robust (or other types of assumption are more important) Assessing Normality is not straightforward and its importance depends on purpose of analysis 22

Assessing Normality - a Quick and Dirty method About 2½% of population lies below µ - 2σ About 16% of population lies below µ - σ If variable is positive and µ < σ then Normal distribution highly questionable If variable positive and µ 2σ then Normal distribution might be just about OK If value lies inbetween then appropriateness should be judged accordingly Handy method when reading papers - not at all foolproof 23

Plot data: I histogram Histogram is obvious way to see if shape of distribution of sample is close to that of Normal curve In small samples this is not easy:.8.6.4 Expt==1 Expt==2 Expt==3.2 Samples of size 1 from distribution known to be Normal Fraction.8.6.4.2.8.6.4.2 Expt==4 Expt==7 Expt==5 Expt==8 Expt==6 Expt==9 6 8 1 12 6 8 1 12 6 8 1 12 X Histograms 24

Plot data: II Normal Probability Plot Normal Probability Plot for C1 99 95 9 Mean: StDev: 18.338 5.2545 8 Percent 7 6 5 4 3 2 1 5 1 95 1 15 11 115 12 Data Plot ordered sample values against appropriate y-value If data are Normal then straight line will result 25

8 7 6 5 4 3 2 1 7 7. 5 8. 8. 5 9. 9. 5 1. 1. 5 1 1. 1 1. 5 1 2. 1 2. 5 1 3 How does a Normal probability plot work? Depends on expected positions for ordered values in a Normal sample 5 25 1 Sample size 5 4 3 2 1 µ 2σ µ σ µ µ+σ Ordered variable µ+2σ 26

Normal plot for non-normal data Normal Probability Plot for C1.4 99 Mean: StDev: 3.7 3.1.3 95 9 Percent 8 7 6 5 4 3 Fraction.2 2 1.1 5 1-5 5 Data 1 15 5 1 15 2 Variable 27