Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Similar documents

Descriptive Statistics

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Statistics. Measurement. Scales of Measurement 7/18/2012

Lesson 4 Measures of Central Tendency

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Means, standard deviations and. and standard errors

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

How To Write A Data Analysis

Lecture 1: Review and Exploratory Data Analysis (EDA)

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Foundation of Quantitative Data Analysis

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Exploratory Data Analysis. Psychology 3256

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Introduction; Descriptive & Univariate Statistics

Measures of Central Tendency and Variability: Summarizing your Data for Others

Northumberland Knowledge

Exploratory data analysis (Chapter 2) Fall 2011

Descriptive Statistics and Measurement Scales

Week 1. Exploratory Data Analysis

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Summarizing and Displaying Categorical Data

Descriptive Statistics

Descriptive statistics parameters: Measures of centrality

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

CALCULATIONS & STATISTICS

2. Filling Data Gaps, Data validation & Descriptive Statistics

3: Summary Statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Data Exploration Data Visualization

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

THE BINOMIAL DISTRIBUTION & PROBABILITY

Geostatistics Exploratory Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Without data, all you are is just another person with an opinion.

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Exploratory Data Analysis

Statistics Review PSY379

Frequency Distributions

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Module 4: Data Exploration

Variables. Exploratory Data Analysis

II. DISTRIBUTIONS distribution normal distribution. standard scores

Descriptive Statistics

6.4 Normal Distribution

MEASURES OF VARIATION

Exercise 1.12 (Pg )

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

4. Continuous Random Variables, the Pareto and Normal Distributions

DATA INTERPRETATION AND STATISTICS

3.2 Measures of Spread

Standard Deviation Estimator

Diagrams and Graphs of Statistical Data

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

CHAPTER THREE. Key Concepts

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Quantitative Methods for Finance

Lecture 2. Summarizing the Sample

Basics of Statistics

Algebra I Vocabulary Cards

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

A Picture Really Is Worth a Thousand Words

Week 3&4: Z tables and the Sampling Distribution of X

Statistics Revision Sheet Question 6 of Paper 2

How To Test For Significance On A Data Set

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Dongfeng Li. Autumn 2010

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Describing Data: Measures of Central Tendency and Dispersion

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Module 3: Correlation and Covariance

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

First Midterm Exam (MATH1070 Spring 2012)

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Description. Textbook. Grading. Objective

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Mean = (sum of the values / the number of the value) if probabilities are equal

Chapter 2 Statistical Foundations: Descriptive Statistics

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Introduction to Quantitative Methods

AP STATISTICS REVIEW (YMS Chapters 1-8)

1 Descriptive statistics: mode, mean and median

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

ELEMENTARY STATISTICS

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

The Dummy s Guide to Data Analysis Using SPSS

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

TEACHER NOTES MATH NSPIRED

Transcription:

Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Statistics as a Tool for LIS Research Importance of statistics in research Summarize observations to provide answers to research questions and hypotheses Make general conclusions based on specific study observations Objectively evaluate reliability of study conclusions

Statistics as a Tool for LIS Research Main purposes of statistics in research Describe central point in a set of data/observations Describe how broad, diversified, or variable the data in a set is Indicate whether specfic features of a set of data are related, and how closely they are related Indicate probability of features of data being influenced by factors other than simply chance

Statistics as a Tool for LIS Research Two main types or branches of statistics Descriptive statistics Characterizing or summarizing data set Presenting data in charts and tables to clarify characteristics No inference, just describing a particular group of observations Inferential statistics Using sample data to make generalizations (inferences) or estimates about a population Statements made in terms of probability

Statistics as a Tool for LIS Research Descriptive and inferential statistics not mutually exclusive Overlap in what can be called descriptive and what can be called inferential Intent is important: Group of observations intended to describe an event: descriptive Group of observations collected from a sample and intended to predict what a larger population is like: inferential

Statistics as a Tool for LIS Research Choosing statistical methods Type of data collected largely determines choice of statistical analysis techniques Decisions about how and what type of data is collected will determine the specific statistical tests that can be performed to analyze the data Data collected should determine statistical tests used, not the other way around But consideration of how you want to analyze data should be done as part of research design to ensure study can produce the type of conclusions you want to make

Descriptive Statistics Commonly used in LIS research Cannot test causal relationships Primary strength is describing and summarizing data: Describing data in terms of frequency distributions Describing most typical value in data set - measures of central tendency Describing variability of data - measures of dispersion

Frequency Distributions Describing data in terms of frequency distributions Counts of totals by value or category for each measured variable Can be presented as absolute totals, cumulative totals, percentages, grouped totals Books checked out 80 60 40 20 Often a first step in statistical analysis of data Usually presented in tables or charts (histogram, bar graph, etc.) 0 0-10 11-20 21-40 41-60 61+ Age group

Describing most typical value in data set - measures of central tendency Mean is often referred to as average though average can be any of these measures of central tendency: Mean (arithmetic average) Median Mode

Mean Most popular statistic for summarizing data Can be used for interval or ratio data Based on all observations of the data set Arithmetic average of a set of observations Example: mean of 5, 10, and 30 is 15, since 45 3 = 15 Mean of a set of numbers can be a number not in set Example: mean of 1, 2, 3, and 4 is 3.5, since 10 4 = 2.5

ple size σ population stdev jth quartile Samplep standard population deviation: General N population addition rule: size Probab P(A proportion ple mean x 2 ( x) d 2 paired /n difference population s CHAPTER size µ population mean n 1 O observed (x x) frequency s 2 Mean of a discrete x random ple stdev ˆp sample proportion or s 2 ( x population mean E expected n 1frequency Probab nwhere 1 (n uartile + 1)/2, 3(n + p1)/4 population proportion Standard Descriptive Statistics Quartile positions: N den CHAPTER (n + 1)/4, 3 (n deviation of a disc lation size O observed frequencyσ Descriptive (x + 1)/2, µ) 2 3(n P(X Measu + 1)/ x Q Descriptive 1 Measures lation Formula mean of mean Sample mean: x x Specia mean: x x Interquartile E expected range: frequency IQR Factorial: Q 3 Q 1 k! k(k where 1) Upper limit Q 3 + 1.5 IQR Sample mean: = mean of sample n P n Lower limit Q 1 1.5 IQR, Upper limit N( Q deno ) 3 n + 1 (A, B, able): = mean of population Range: Range Max Min Range µ x Binomial coefficient: criptive Measures MaxN Min Population mean (mean Population of a variable): mean: µ x Specia x = sigma, sum of values N Comp standard deviation: of a variable): Sample Binomial standard probability deviation: formu x x P X n= set of observations Population STAstandard 570 deviation (standard Formula deviation Sheet (x x) 2 Gener of a v (x x) 2 x 2 x X i = specific observations or s 2 ( x) 2 /n (A, B, P(X x) or Maxσ Min Sample Mean = X s = X 1+X 2 +...+X o n n 1 (x n µ) 1 = Mean σ 2 x N µ2 or nσ 1 2 Compl n n or N = number of N where n denotes the numbe N rd deviation: positions: µ (n + 1)/4, (n + 1)/2, 3(n observations Quartile + 1)/4 Standa probability. positions: (n + 1)/4, Gary Sample Geisler Simmons Variance College LIS 403 Spring, = 2004 rtile range: IQR Standardized Q s 2 = (X 1 X) 2 +(X 2 X) 2 3 Q variable: z x µ Genera (x x) 2 x 2 ( x) 2 /n 1 +. or s Interquartile Mean σ of arange: binomial IQRrandom Q n i=1 X σ

Median Value that is above the lower one-half and below the upper one-half of the values -- middle value of set of observations when they have been arranged in order Can be used for ordinal, interval or ratio data Most central measure of a distribution Every data set has a median that is unique Difference in sets with odd numbers of observations than for even numbers of observations Example: median of the five observations 1, 3, 15, 16, and 17 = 15 Example: median of the six observations 1, 2, 3, 5, 8, and 9 = 4

Mode Can be used for any type of data Most frequently occuring value among a set of observations Examples: Mode of the observations 1, 2, 2, 3, 4, 5 = 2 Set of observations 1, 2, 3, 4, 5 has no mode Set of observations 1, 2, 3, 3, 4, 5, 5 has no single mode, but can be considered to have two modes, or is bi-modal

Advantages of mean Always exists Is unique Can always be calculated by a simple formula Disadvantages of mean Mean value for a data set is not necessarily one of the values of the data set Sensitive to extreme scores, either high or low Easily distorted by extremely large or extremely small values among the set of observations, Example: mean of 1, 2, and 1,000,000 is 333,334.33

Advantages of median Not affected by extreme scores Useful way of describing sets of observations that are skewed by including extremely large or small values Disadvantages of median Median is not necessarily one of the values of the data set Defined differently for odd and even numbers of observations

Advantages of mode Can be used with any scale of measurement If set of observations has a mode, mode usefully characterizing the set For example, set of observations noting result of rolling two dice will have a mode of 7 Disadvantages of mode Many sets of observations lack a mode because no observed value occurs more than once Other sets of observations may have several different most frequent values Doesn t characterize set beyond most frequently occuring value

Calculating mean Age 13 14 15 16 17 18 19 Frequency

Calculating mean Age 13 14 Frequency 13 x 3 = 39 14 x 4 = 56 15 15 x 6 = 90 16 16 x 8 = 128 17 17 x 4 = 68 18 18 x 3 = 54 19 19 x 3 = 57 N = 31 Sum of X = 492 492/31 = 15.87 Mean = 15.87

Calculating Age mode 13 14 Frequency 15 16 Mode = 16 17 18 19

Calculating median Age 13 Frequency 1-3 Non-grouped data 14 15 4-7 8-13 16 14-21 17 22-25 18 26-28 19 29-31 N = 31 so midpoint is 16th value Median = 16

Calculating median Age 13 14 Frequency 1-3 4-7 Grouped data: Each value is somewhere within each age range Values are assumed to be equally distributed within range 15 16 17 18 19 N = 31 so midpoint is 16th value 8-13 14-21 22-25 26-28 29-31 Median = 16.31 14 15 16 17 18 19 20 21 16.06 16.19 16.31 16.44 16.56 16.69 16.81 16.94

Mean = 15.87 Mode = 16 Median = 16.31

Normal distribution Normal curve, bell-shaped curve, Gaussian distribution Many types of data are normally distributed in a population Histogram of data approximates a bell-shaped, symmetrical curve Concentration of scores in the middle, with fewer and fewer scores as you approach extremes Example: heights of people in a population are normally distributed

Skewness Not all sets of data will exhibit properties of a normal distribution Some data sets are asymmetrical around a central point Majority of scores are closer to one extreme or the other: skewed distribution In a skewed distribution, the mean does not equal the median

Positively skewed distribution, tail goes to the right - median is less than the mean Example: Annual income of population Negatively skewed distribution tail goes to the left - mean is less than the median

Special case of skewness: J-Curve Extreme skewness Proposed by Allport to describe conforming behavior in groups of people Large majority of scores fall at end representing socially acceptable behavior, small minority represent deviation from norm Example: amount of time drivers who park in No Parking zone stay there 100 75 50 25 0 < 5 5 to 10 10 to 15 15 to 20 20 to 25 >25

Determining when a distribution is skewed too much to be considered normal General rule of thumb: values beyond 2 standard errors of skewness (ses) are probably significantly skewed ses = 6/N or use ses statistic from software (SPSS, for example) output Example: if sample size = 30 and skewness statistic is.9814: ses = 6/30 =.20 =.4472 2 ses =.4472 x 2 =.8944 skewness statistic of.9814 is beyond 2 ses, so is significantly skewed Other factors (histograms, normal probability plots, type of test to be used) should influence decision, depending on exact circumstances of analysis

Kurtosis - amount of peakedness or flatness of the distribution Mesokurtic - normal Leptokurtic - peaked, many scores around middle Platykurtic - flat, many scores dispersed from middle Non-normal kurtosis determined by similar process to skewness Non-normal kurtosis only a concern with some statistical tests

Selecting appropriate measure of central tendency Interactive selection at Selecting Statistics by William M.K. Trochim: http://trochim.human.cornell.edu/selstat/ssstart.htm Rules below can be bent, depending on situation Unimodal, Ratio or interval data, skewed Unimodal, Ratio or interval data, not skewed Unimodal, ordinal Unimodal, Nominal Bi-modal or multi-modal distribution median mean median mode mode

Measures of Dispersion Variability is a fundamental characteristic of most data sets, but is not addressed by measures of central tendency Measures of central tendency are not enough to accurately describe a data set Also need to be able to describe the variability or dispersion of the data Dispersion: scatteredness or flucuation of scores around average score Several types of measures of dispersion Range Standard deviation Variance

Measures of Dispersion Range Distance between the smallest and largest observations in a set of data Examples: Range of the set of observations 2, 4, 7 is 5 Range of the set -10, -3, 4 is 14

Measures of Dispersion Interquartile range Simplified version: ignore the top and bottom 25% after sorting Difference between the remaining largest and smallest numbers is interquartile range Addresses the problem of outliers Other methods of calculating interquartile range are slightly more complicated but take into account more data

Measures of Dispersion Standard deviation Measures the variability or the degree of dispersion of the data set Square root of the average squared deviations from the mean Roughly speaking, standard deviation is the average distance between the individual observations and the center of the set of observations

Range: Range Max Min Measures of Dispersion Calculating standard deviation 1. Subtract each each observation from sample/population mean and square 2. Add squared distances 3. Divide sum by n - 1 or N (adjusted mean of squared distances) 4. Take square root of mean squared distances Sample CHAPTER standard 3 Descriptive deviation: Meas (x x) Sample s mean: x 2 x or n 1 n Quartile Range: positions: Range (n Max + 1)/4, Mi Interquartile Sample standard deviation: range: IQR Q 3 (x x) SD of sample: Lower limit s Q 2 1 1.5 IQR, n 1 Population Quartile mean positions: (mean (n of + a 1)/ va Population Interquartile standard range: deviation IQR ( Q Lower limit (x Q 1 µ) SD of population: 1.5 IQ σ 2 N Population mean (mean of a Standardized variable: z x Population standard deviatio CHAPTER 4 Descriptive (x Method µ) σ 2 S xx, S xy, and S yy : N

Measures of Dispersion Variance Square of standard deviation Not used for descriptive statistics, but is important for specific inferential statistics tests Variance of sample Variance of population

Measures of Dispersion Advantages of range as measure of dispersion Very simple to calculate Provides a meaningful characteristic of a set of observations (total spread of the observations) Disadvantages of range as measure of dispersion Extreme values distort range Only measures the total spread; tells us nothing about the pattern of data distribution Examples: Data set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8 Data set 1, 9, 9, 9, 9, 9, 9, 9, 9 also has range of 8, though clearly less scattered

Measures of Dispersion Advantages of standard deviation as measure of dispersion Can always be calculated Meaningful characteristic of a set of observations; takes every observation into account to express the scatteredness of observations Examples: Set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation s = 2.74 Set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67 Range doesn t distinguish difference in scatteredness of sets, but standard deviation does Disadvantage of standard deviation as measure of dispersion is that it is more complicated to calculate -- though not for computers