Desciptive Statistics Qualitative data Quantitative data Graphical methods Numerical methods

Similar documents
Exploratory data analysis (Chapter 2) Fall 2011

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Exercise 1.12 (Pg )

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Lecture 1: Review and Exploratory Data Analysis (EDA)

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Variables. Exploratory Data Analysis

Descriptive Statistics

Diagrams and Graphs of Statistical Data

Describing, Exploring, and Comparing Data

3: Summary Statistics

AP * Statistics Review. Descriptive Statistics

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Summarizing and Displaying Categorical Data

Using SPSS, Chapter 2: Descriptive Statistics

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Data Exploration Data Visualization

Chapter 2: Frequency Distributions and Graphs

Statistics Chapter 2

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

How To Write A Data Analysis

Foundation of Quantitative Data Analysis

Northumberland Knowledge

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Statistics Revision Sheet Question 6 of Paper 2

THE BINOMIAL DISTRIBUTION & PROBABILITY

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

determining relationships among the explanatory variables, and

Mathematical goals. Starting points. Materials required. Time needed

Bar Graphs and Dot Plots

Intro to Statistics 8 Curriculum

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Exploratory Data Analysis. Psychology 3256

Week 1. Exploratory Data Analysis

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

List of Examples. Examples 319

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

2 Describing, Exploring, and

Exploratory Data Analysis

Sta 309 (Statistics And Probability for Engineers)

Module 4: Data Exploration

Describing and presenting data

Measurement with Ratios

Basics of Statistics

Sampling and Descriptive Statistics

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Grade 6 Mathematics Assessment. Eligible Texas Essential Knowledge and Skills

MEASURES OF VARIATION

Means, standard deviations and. and standard errors

STAT355 - Probability & Statistics

Chapter 1: Exploring Data

Mean = (sum of the values / the number of the value) if probabilities are equal

Scatter Plots with Error Bars

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Algebra I Vocabulary Cards

Lecture 2. Summarizing the Sample

Practice#1(chapter1,2) Name

DesCartes (Combined) Subject: Mathematics Goal: Statistics and Probability

Shape of Data Distributions

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

a. mean b. interquartile range c. range d. median

Box-and-Whisker Plots

Geostatistics Exploratory Analysis

Mind on Statistics. Chapter 2

Interpreting Data in Normal Distributions

Common Tools for Displaying and Communicating Data for Process Improvement

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Algebra 1 Course Information

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

CAMI Education linked to CAPS: Mathematics

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Descriptive statistics parameters: Measures of centrality

Common Core Unit Summary Grades 6 to 8

AP STATISTICS REVIEW (YMS Chapters 1-8)

Manhattan Center for Science and Math High School Mathematics Department Curriculum

Descriptive Statistics

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Chapter 3. The Normal Distribution

+ Chapter 1 Exploring Data

What Does the Normal Distribution Sound Like?

First Midterm Exam (MATH1070 Spring 2012)

Big Ideas in Mathematics

Chapter 2 Data Exploration

Descriptive statistics; Correlation and regression

AP Statistics Solutions to Packet 2

Transcription:

Desciptive Statistics Qualitative data Quantitative data Graphical methods Numerical methods

Qualitative data Data are classified in categories Non numerical (although may be numerically codified) Elements Class Each one of the categories to classify the data Frequency Number of cases in each class Relative frequency Frequency divided by total number of cases Class percentage Relative frequency multiplied by 100

Example: Aphasia DATA SET Subject Type of Aphasia Subject Type of Aphasia 1 Broca s 12 Broca s 2 Anomic 13 Anomic 3 Anomic 14 Broca s 4 Conduction 15 Anomic 5 Broca s 16 Anomic 6 Conduction 17 Anomic 7 Conduction 18 Conduction 8 Anomic 19 Broca s 9 Conduction 20 Anomic 10 Anomic 21 Conduction 11 Conduction 22 Anomic

Example: Aphasia Number of cases 22 Classes Anomic, Broca s, Conduction Frequencies Anomic: 10; Broca s: 5; Conduction: 7 Relative frequencies Anomic: 0:45; Broca s: 0:23 ; Conduction: 0:32 Class percentage Anomic: 45%; Broca s: 23% ; Conduction: 32% IN TABLE FORM: Classes Anomic Broca s Conduction Total Frequencies 10 5 7 22 Relative frequencies 0:45 0:23 0:32 1:00 Class percentage 45% 23% 32% 100%

Summarizing the data We will use: Numerical Graphical methods

Graphical methods Bar graphs The height of the bar may represent the frequency the relative frequency the percentage Pie charts Relative frequencies are represented by fraction of total area Pareto Diagrams Bar graphs with classes ordered by size

Bar graph

Pie chart

Quantitative variables A variable is quantitative...... if it represents a measure, given in a meaningful numerical scale: age, height, time, lenght, concentration, pressure,... Again, we can summarize the data using: Graphical Numerical methods

A few graphical methods for quantitative variables We want to put some order in the set of numbers to get an idea about the size of the numbers and their spread Methods Stem and leaf displays Put together all the numbers that have same first digit. Order them by the second digit Dot plots Histograms

Stem and leaf Group the numbers by all-but-last equal digits List vertically only once the group digits Write last digits ordered within each group.

Example From the collection of numbers 225 228 252 228 237 237 240 198 240 210 210 210 228 198 228 240 192 240 240 192 210 225 228 231 210 225 264 204 240 240 210 255 237 207 First we will get: And then: 19 8822 19 2288 20 47 20 47 21 000000 21 000000 22 58888585 22 55588888 23 7717 23 1777 24 0000000 24 0000000 25 25 25 25 26 4 26 4

Dot plots Dot plots display a dot for each observation Dots for repeated values are aligned next to each other Lines of dots representing consecutive values are place next to each other We may have too many distinct values to plot. Then values are placed in classes before plotting

Example Use data from exercise 2.182 on hear loss to build the dot plot. Coding goes from 1: hearing within normal limits to 7: severe-to-profound loss 6 7 1 1 2 6 4 6 4 2 5 2 5 1 5 4 6 6 5 5 5 2 5 3 6 4 6 6 4 2

Histograms Histograms are graphs of the frequency or relative frequency of a variable on an interval (class interval). Usually: Class intervals are placed on the horizontal axis Class intervals have a length proportional to their width (in the units that the variable is measured) In most cases we will use equal length intervals The frequencies or relative frequencies are marked on the vertical axis

Example A 1903 paper published a report on length of Cuckoo s (Cuculus canorus) eggs classified by the species of the nest where they where found. The following measures, in mm, correspond to the length of the eggs found in Meadow Pipit s (Anthus pratensis) nests. 19.65 20.05 20.65 20.85 21.65 21.65 21.65 21.85 21.85 21.85 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.25 22.25 22.25 22.25 22.25 22.25 22.25 22.25 22.45 22.45 22.45 22.65 22.65 22.85 22.85 22.85 22.85 23.05 23.25 23.25 23.45 23.65 23.85 24.25 24.45

Example The values are already ordered from lowest to highest. We have 45 values so we will use 10 intervals of equal length. As a rule of thumb: divide the number of values by 5; never take more than 20 intervals. The range is 24:45 19:65 = 4:80 5:00. Then, the intervals will have a length of :5 mm: 19:50 20:00 20:00 20:50 20:50 21:00 21:00 21:50 21:50 22:00 22:00 22:50 22:50 23:00 23:00 23:50 23:50 24:00 24:00 24:50 Count the number of cases in each interval: 1 1 2 0 6 21 6 4 2 2 We construct bars of this height on the corresponding intervals.

Descriptive numerical measures For central tendency Mean Median Mode For variability Range Sample variance Sample standard deviation Interquartile range

Sample mean The average of the numbers in the sample If the list of numbers is x 1 ; x 2 ; : : : ; x n the mean is x = P ni=1 x i n Example x 1 = 3:2; x 2 = 4:3; x 3 = 5:4; x 4 = 3:1; x 5 = 2:7 X x = 18:7 x = 3:74

Example Is this mean representative of the given list of numbers? x 1 = x 2 = = x 9 = 1; x 10 = 100 X x = 109 x = 10:9 11

Median Given a list of numbers a median is any number that divides the ordered list in two equal parts Formally, M is a median of the list x 1 ; x 2 ; : : : ; x n if 1. the number of elements in the list that are greater than or equal to M is at least n 2 AND 2. the number of elements in the list that are less than or equal to M is at least n 2 If there is more than one median, usually the average of the largest and the smallest is chosen as the median

Example The only median of x 1 = 3:2; x 2 = 4:3; x 3 = 5:4; x 4 = 3:1; x 5 = 2:7 is 3:2 However, all the numbers in the interval [3:2; 3:8] are medians of the list x 1 = 3:2; x 2 = 4:3; x 3 = 5:4; x 4 = 3:1; x 5 = 2:7; x 6 = 3:8 although usually, 3:2 + 3:8 2 = 3:5 is taken as the median Observe the difference in finding the median when we have an even or we have an odd number of elements in the list

Example For the set of numbers x 1 = x 2 = = x 9 = 1; x 10 = 100 i.e. 1 1 1 1 1 1 1 1 1 100 the only median is 1: there are 9 ( 10 2 ) elements in the list that are less than or equal to 1 and there are 10 ( 10 2 ) elements in the list that are grater than or equal to 1

Mode Definition The mode of a list of numbers is the most frequent number in the list In the examples: In the last list of numbers: the mode is 1, that appears 9 times in the list In the egg length example: the mode is 22:05, that appears 10 times in the list

Modal class Sometimes, when the measure of the variable is "continuous", there may be no repetitions, but when the data are grouped in intervals to draw a histogram, one of these intervals is more frequent than the others. This interval is called modal class The center point of this modal class is sometimes taken as the mode In the egg length example, the modal class is the interval [22:0; 22:5] Using only the information from the histogram we would take the mode as 22:25

Relationships If the data are symmetric Mean = Median If there are a few values a lot bigger than the rest Mean>Median We say that the data are skewed to the right If there are a few values a lot smaller than the rest Mean<Media We say that the data are skewed to the left

Numerical measures of variability Range Sample variance (sample standard deviation) Quartiles Interquartile range Percentiles

Range Definition The range of a list of numbers is simply the highest value minus the lowest value in the list

Sample variance To measure the spread of the data around the center we look at the distances from the values to the center and average them in a special form: Definition The sample variance of the list of numbers x 1 ; x 2 ; : : : ; x n is s 2 = P ni=1 (x i x) 2 (n 1) Observe the square in the notation: s 2, the square in each term of the sum: (x i x) 2 the average of n terms is taken dividing by n 1

Standard Deviation The unit of measure in the sample variance is the square of the unit in the data. To have an idea of the spread in terms of the unit of measure in the variable, take the square root Sample Standard Deviation s = p s 2 A simplifying formula nx i=1 (x i x) 2 = nx i=1! xi 2 n x 2

Extracting info from the SD The SD tells us how far from the mean the data points are. We can use two distinct criteria to interpret its value: The Chebishev rule, very conservative and valid for any data set: The fraction of points within k standard deviations from the mean is at least 1 1 k 2 = k 2 1 k 2 The normal rule, valid for special types of symmetric data 68% of the data are within one SD from the mean 95% of the data are within two SD s from the mean More than 99% of the data are within three SD s from the mean

Percentiles Definition For any number p; 0 < p < 100, we say that P is the p th percentile of a data set if 1. the number of elements in the set that are less than or equal to P is at least p% of the data set AND 2. the number of elements in the set that are greater than or equal to P is at least (100 p)%

Quartiles The 25 th percentile is called the first or lower quartile (Q1) The 75 th percentile is called the third or upper quartile (Q3) The median coincides with the 50 th percentile, that is also the second or middle quartile The percentiles that are a multiple of 10 are called deciles Interquartile range IQR = Q3 Q1

Example

Outliers Outliers are unusually large or small measurements in a data set. Outliers may be due to one of the following causes: 1. The measurement is incorrect. 2. The measurement does not correspond to the same population. 3. The measurement corresponds to an odd event in the population. Normally we will consider potential outliers those values that are either (1:5)IQR below Q1 or (1:5)IQR above Q3.

Boxplots A Boxplot, or box and whiskers diagram, gives an idea of the spread and shape of the data set. To make a Boxplot we need: The five number summary: Min, Q1, Median, Q3, Max The Interquartile range: the difference IQR = Q3 Q1. The "inner" and "outer" fences The outliers Then: Draw the Median Draw Q1 and Q3 (the "hinges") and close the box Mark the outliers Find the min and max after discarding the outliers Draw the whiskers

The TILLRATIO data sets Ratio Al / Be # Location Ratio 1 UMRB-1 3.75 2 UMRB-1 4.05 3 UMRB-1 3.81 4 UMRB-1 3.23 5 UMRB-1 3.13 6 UMRB-1 3.30 7 UMRB-1 3.21 8 UMRB-2 3.32 9 UMRB-2 4.09 10 UMRB-2 3.90 11 UMRB-2 5.06 12 UMRB-2 3.85 13 UMRB-2 3.88 14 UMRB-3 4.06 15 UMRB-3 4.56 16 UMRB-3 3.60 17 UMRB-3 3.27 18 UMRB-3 4.09 19 UMRB-3 3.38 20 UMRB-3 3.37 21 SWRA 2.73 22 SWRA 2.95 23 SWRA 2.25 24 SD 2.73 25 SD 2.55 26 SD 3.06

SPSS output: Stem-and-leaf

SPSS: Descriptive Statistics

SPSS: Box and whiskers 1

SPSS: Box and whiskers 2

Bivariate data In the data ALWINS we expect that there should be a "positive" relationship between the two variables. One way to try to find out if this relationship exists is ploting the data in what is called a scatterplot. We draw two axes and plot each case as a point on the plane choosing one of the variables as x and the other as y. We observe if they roughly line up.

Scatterplot The data in ALWINS gives the following plot