BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Similar documents

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Exploratory data analysis (Chapter 2) Fall 2011

Lecture 1: Review and Exploratory Data Analysis (EDA)

Variables. Exploratory Data Analysis

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Geostatistics Exploratory Analysis

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Exploratory Data Analysis

How To Write A Data Analysis

Week 1. Exploratory Data Analysis

Exercise 1.12 (Pg )

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

3: Summary Statistics

Data Exploration Data Visualization

Exploratory Data Analysis. Psychology 3256

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Module 4: Data Exploration

Descriptive Statistics

Random Variables. Chapter 2. Random Variables 1

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

WEEK #22: PDFs and CDFs, Measures of Center and Spread

Lecture 2. Summarizing the Sample

Diagrams and Graphs of Statistical Data

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Manhattan Center for Science and Math High School Mathematics Department Curriculum

Intro to Statistics 8 Curriculum

THE BINOMIAL DISTRIBUTION & PROBABILITY

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Summarizing and Displaying Categorical Data

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

EXPLORING SPATIAL PATTERNS IN YOUR DATA

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

7 CONTINUOUS PROBABILITY DISTRIBUTIONS

Introduction to Exploratory Data Analysis

Dongfeng Li. Autumn 2010

determining relationships among the explanatory variables, and

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

List of Examples. Examples 319

COMMON CORE STATE STANDARDS FOR

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

How To Check For Differences In The One Way Anova

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

II. DISTRIBUTIONS distribution normal distribution. standard scores

MBA 611 STATISTICS AND QUANTITATIVE METHODS

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

STAT355 - Probability & Statistics

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Descriptive statistics parameters: Measures of centrality

AMS 7L LAB #2 Spring, Exploratory Data Analysis

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Interpreting Data in Normal Distributions

Fairfield Public Schools

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Statistics Review PSY379

Descriptive Statistics

Northumberland Knowledge

Algebra I Vocabulary Cards

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Section 1.3 Exercises (Solutions)

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

2. Simple Linear Regression

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

CALCULATIONS & STATISTICS

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Assignment #03: Time Management with Excel

Descriptive Statistics and Exploratory Data Analysis

The Comparisons. Grade Levels Comparisons. Focal PSSM K-8. Points PSSM CCSS 9-12 PSSM CCSS. Color Coding Legend. Not Identified in the Grade Band

Chapter 2 Data Exploration

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Quantitative Methods for Finance

AP Statistics: Syllabus 1

Descriptive Statistics

Describing Data: Measures of Central Tendency and Dispersion

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

Description. Textbook. Grading. Objective

Multivariate Normal Distribution

Chapter 4 - Lecture 1 Probability Density Functions and Cumul. Distribution Functions

a. mean b. interquartile range c. range d. median

Mean = (sum of the values / the number of the value) if probabilities are equal

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Module 2: Introduction to Quantitative Data Analysis

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Organizing Your Approach to a Data Analysis

Transcription:

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I

Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential statistical methods, with a focus on using MATLAB for implementation. The four modules are: Introduction and descriptive statistics Probability distributions Hypothesis testing Correlation and regression Each lecture will be supplemented with a MATLAB tutorial on the same topic We will work through part or all of the tutorial after reviewing the concepts; anything we don t get to should be reviewed outside of class! 2

Statistics a (very brief) introduction 1663 Natural and Political Observations upon the Bills of Mortality by John Graunt is published Motivated by the desire to base policy on demographic data 1700s Laplace introduces the normal distribution and regression via his study of astronomy 1800s Quetelet applies statistical analysis to human biology The central purpose of statistics is to learn more about some population of interest (e.g., all humans in the world) However, we very rarely, if ever, have access to every individual in the population! sample a subset of the entire population??? compilation of data about the entire population With a sample in hand, we seek to either summarize that data (using descriptive statistics) or use the data to make some prediction or statement about the population (using inferential statistics) 3

The Central Dogma of Statistics used to summarize data; (this is the focus for today) used to make inferences about the population

Dimensionality of data sets Univariate: measurements made on one variable per subject. This will be the focus for modules 1-3 Bivariate: measurement made on two variables per subject. Multivariate: measurement made on many variables per subject.

Types of descriptive statistics Central tendency measures: computed to give a center around which the measurements in the data are distributed (also called measures of location. (mean, median, mode, quartiles) Variation or variability measures: describe data spread or how far the measurements are away from the center. (variance, standard dev.) Relative standing measures: describe the relative position of specific measurements in the data. (percentiles)

Central tendency measures: mean The sample mean (a.k.a. average): To calculate the average x of a set of observations, add their values and divide by the number of observations: n x 1 +...+ x n 1 x = = Σ x n n i i = 1

Central tendency measures: median Median the exact middle value Calculation If there are an odd number of observations, find the middle value If there are an even number of observations, find the mean of the middle two values Example: Median = ave(22,23) = 22.5 Age of students: 17 19 21 22 23 23 23 28

Which central tendency measure is better? In other words, which measure better approximates the center of a data set? Mean is best for symmetric distributions w/o outliers Median is useful for skewed distributions or data with (one-directional) outliers mean = 3.125 median = 3 mean = 4.857 median = 4

Scale: variance The sample variance is the average of squared deviations of values from the mean n s 2 = Σ(x i x) 2 i = 1 n 1 Square the deviations to get rid of the negatives The result is that the contribution to the variance increases as you go farther from the mean in either direction

Scale: standard deviation s = i n Σ(x i x) 2 = 1 n 1 Procedure to obtain the sample standard deviation: Score/measure observations (in the units that are meaningful, let s say m/s) Find the mean of the observations (m/s) Find each score s deviation from the mean (m/s) Square all those deviations (m/s) 2 Divide by n 1 (m/s) 2 (note that this is the variance) square root (m/s) now we have the starting units! Let s do a simple example problem!

Central tendency measures: mode The mode is the observation that takes place most frequently in a data set Unlike the mean or median, the mode is not necessarily unique the same maximum frequency may occur at different values. Based on the previous slide, is the mode a parametric statistic? (hint: remember that it is a measure of central tendency)

Scale: quartiles and IQR Q 2 is the same as the median The first quartile (Q 1 ) and third quartile (Q 3 ) are the medians of the data sets that would be created if all of the values below and above Q 2, respectively, were chosen. The interquartile range (IQR) is Q 3 Q 1

Quartiles example problem Find the three quartiles and IQR of the following two datasets: 2 3 6 10 12 14 15 18 2 5 9 11 13 15 19 21 24 Q 1 = 4.5 Median = 11 Q 3 = 14.5 IQR = 10 Q 1 = 7 Median = 13 Q 3 = 20 IQR = 13 Note from this example that the 25% rule from the previous slide isn t precisely correct It is easiest to first insert the median the lower and upper halves from which to find Q 1 and Q 3 should then be obvious

Percentiles (aka quantiles) Generally, the n th percentile is a value such that n% of the observations fall at or below it: Q 1 = 25 th percentile Median = 50 th percentile Q 2 = 75 th percentile

Univariate data: histograms and bar plots What s the differences between a histogram and bar plot? Bar plot Used for categorical variables to show frequency or proportion in each category. Translate the data from frequency tables into a pictoral representation... Histogram Used to visualize distribution (shape, center, range, variation) of continuous variables bin size is important

Effect of bin size on histogram Simulated 1000 N(0,1) 1000 random numbers from the standard normal distribution with mean 0 and st. dev. 1

More on histograms What s the difference between a frequency histogram and a density histogram?

More on histograms What s the difference between a frequency histogram and a density histogram?

More on histograms So, for our roughly gaussian distribution from earlier, the density histogram looks like this: 0.4 0.35 0.3 mean = 3.125 relative frequency 0.25 0.2 0.15 0.1 mean = 4.857 median = 3 0.05 median = 4 0 1 2 3 4 5 observation

Stem and leaf plots

Box and whisker plots An outlier is a score either 1.5 IQR above the upper quartile or below the lower quartile

Example problem Two different classes take a quiz and gets the following scores. Class 1: 2, 4, 6, 8, 10, 12, 14 Class 2: 2, 2, 3, 8, 8, 10, 23 What the mean and median of each set? The same! Will making a box and whisker plot of each set of data give us a better picture of their distributions? (let s do the second one together)

Box plot procedure Steps to make our box plot: Find the median, Q1, Q3, and IQR Draw 3 horizontal lines, at Q1, median, and Q3 Draw the corresponding vertical lines to make the boxes Compute the lower inner fence (Q1 1.5*IQR) and the upper inner fence (Q3 + 1.5*IQR) Draw a whisker downward from Q1 to lower inner fence or minimum, whichever comes first Draw a whisker upward from Q3 to upper inner fence or maximum, whichever comes first Compute the lower outer fence (Q1 3*IQR) and the upper outer fence (Q3 + 3*IQR) Mild outliers fall between the inner and outer fences, mark with O Extreme outliers fall outside outer fences, mark with *

Now let s switch over and do some work in MATLAB! 25

Probability Distributions http://www-users.york.ac.uk/~pml1/bayes/cartoons/cartoon08.jpg

The Central Dogma of Statistics (this is the focus for today) i.e., the probability distribution

Probability distributions We ve discussed that data can be normally distributed (a.k.a. Gaussian or bell-shaped ) in fact, many reallife variables are, including: http://sixminutes.dlugan.com/wp-content/uploads/2010/02/height-bell-curve.jpg http://www.davinciinstitute.com/wp-content/uploads/2011/11/voter-iq-bell-curve.jpg But what does this mean mathematically?

Probability distributions A probability distribution which can either be discrete or continuous is a table (discrete) or mathematical function (continuous) of one or more variables that describes the likelihood that any given value (discrete) or set of values (continuous) will occur Because the entire population is characterized, the main usefulness is in calculating the probability that certain values (discrete) or a range of values (continuous) will occur First, let s see examine a couple discrete cases (we ll then move to the continuous case)

What is the probability distribution of rolling a die? If all outcomes are equally likely (i.e., if the die is fair ), then: probability distribution: P(1) =? Note the total probability is 1! We use X (upper case) to denote an individual from the population For example, P(X = 2) = 1/6 x i P(x i ) 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 If written as a function, we call it the probability mass function (pmf)

What is the probability distribution of a random number generator? Say you have a program (e.g., rand in MATLAB) that picks a real number between 0 and 1 (the uniform distribution ): f(x) 1 f(x) = 1; 0 x 1 0; otherwise x Since we still need our total area to equal 1, what must the value of f(x) be (i.e., at what y-axis value is the upper line in the graph)? f(x) is the probability density function (pdf) it is the continuous analog of the pmf. Here, we run into a problem: if x can be any real #, what must be the probability of observing a given value i.e., what is P(X = x) for any continuous distribution? Unlike in the discrete case, P(X = x) = 0 in the continuous case

What is the probability distribution of a random number generator (cont.)? In the continuous case, we instead care about the probability of a randomly selected variable X from the distribution being within a certain range of values 1 f(x) f(x) = 1; 0 x 1 0; otherwise x F(x) = x; 0 x 1 0; otherwise Look at f(x) above; if we want P(0.25 < X < 0.75), how can we evaluate this mathematically? Integrating the pdf gives the cumulative distribution function (cdf), or F(x), which is evaluated over the desired limits! Let s do an example what is P(0.25 < X < 0.75)?

The mean and variance of continuous random variables There are many different continuous probability distributions (here are a few examples we will see): Normal Uniform Exponential Parabolic Every distribution has a unique: Expected value: E(X) = μ a weighted average of all the possible values that this random variable can take on Variance: V(X) = σ 2 a measure of the spread, or the extent to which values in the distribution are dispersed If we know the pdf of a given distribution, we can calculate its mean and variance! 33

Expected value of random variables Let s re-visit our die problem; on average, what is the expected value of a roll, given the die goes from 1-6 (hint: it s not one of the integers)? Mathematically, how would you fill in the parentheses below to arrive at the same answer? E(X) = Σ ( )( ) How can we express the same concept in the continuous case? 6 i = 1 E(X) = x f(x) dx - Note that these limits will vary depending on the distribution, according to where f(x) is non-zero Let s try this out for our uniform distribution example and the generalized case!

References Lecture 2 Descriptive Statistics and Exploratory Data Analysis University of Washington School of Medicine. http://www.webquest.hawaii.edu/ http://www.sonoma.edu/users/w/wilsonst/course s/math_300/final/p14/default.html