STAT355 - Probability & Statistics



Similar documents
Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Exploratory data analysis (Chapter 2) Fall 2011

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Variables. Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

3: Summary Statistics

Week 1. Exploratory Data Analysis

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Diagrams and Graphs of Statistical Data

Data Exploration Data Visualization

Exercise 1.12 (Pg )

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

2 Describing, Exploring, and

How To Write A Data Analysis

Exploratory Data Analysis

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Northumberland Knowledge

Chapter 1: Exploring Data

Intro to Statistics 8 Curriculum

430 Statistics and Financial Mathematics for Business

Lecture 1: Review and Exploratory Data Analysis (EDA)

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Geostatistics Exploratory Analysis

Basics of Statistics

Foundation of Quantitative Data Analysis

Descriptive Statistics

MTH 140 Statistics Videos

Algebra 1 Course Information

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Classify the data as either discrete or continuous. 2) An athlete runs 100 meters in 10.5 seconds. 2) A) Discrete B) Continuous

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Manhattan Center for Science and Math High School Mathematics Department Curriculum

Exploratory Data Analysis. Psychology 3256

Fairfield Public Schools

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Box-and-Whisker Plots

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

3. Data Analysis, Statistics, and Probability

MAS131: Introduction to Probability and Statistics Semester 1: Introduction to Probability Lecturer: Dr D J Wilkinson

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Using SPSS, Chapter 2: Descriptive Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

2. Filling Data Gaps, Data validation & Descriptive Statistics

a. mean b. interquartile range c. range d. median

Problem Solving and Data Analysis

List of Examples. Examples 319

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Introduction to Statistics and Probability. Michael P. Wiper, Universidad Carlos III de Madrid

DesCartes (Combined) Subject: Mathematics Goal: Statistics and Probability

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

+ Chapter 1 Exploring Data

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

MBA 611 STATISTICS AND QUANTITATIVE METHODS

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

The Comparisons. Grade Levels Comparisons. Focal PSSM K-8. Points PSSM CCSS 9-12 PSSM CCSS. Color Coding Legend. Not Identified in the Grade Band

Grade 6 Mathematics Assessment. Eligible Texas Essential Knowledge and Skills

Summarizing and Displaying Categorical Data

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Interpreting Data in Normal Distributions

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014

Module 4: Data Exploration

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

12: Analysis of Variance. Introduction

Chapter 4 Lecture Notes

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Statistics Chapter 2

Mind on Statistics. Chapter 2

Chapter 111. Texas Essential Knowledge and Skills for Mathematics. Subchapter B. Middle School

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Data Analysis, Statistics, and Probability

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

AP Statistics: Syllabus 1

Implications of Big Data for Statistics Instruction 17 Nov 2013


UNIT 1: COLLECTING DATA

Chapter 2 Data Exploration

First Midterm Exam (MATH1070 Spring 2012)

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Common Core Unit Summary Grades 6 to 8

COMMON CORE STATE STANDARDS FOR

How To Check For Differences In The One Way Anova

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Transcription:

STAT355 - Probability & Statistics Instructor: Kofi Placid Adragni Fall 2011

Chap 1 - Overview and Descriptive Statistics 1.1 Populations, Samples, and Processes 1.2 Pictorial and Tabular Methods in Descriptive Statistics 1.3 Measure of Location 1.4 Measures of Variability

1.1 Populations, Samples,... Discipline of statistics provides methods for organizing and summarizing data and for drawing conclusions based on information contained in the data. An investigation will typically focus on a well-defined collection of objects constituting a population of interest. When desired information is available for all objects in the population, we have what is called a census. Often, census is impractical or infeasible. A sample, - a subset of the population - is selected in some prescribed manner.

1.1 Populations, Samples,... A variable is any characteristic whose value may change from one object to another in the population. Data results from making observations either on a single variable or simultaneously on two or more variables. A univariate data set consists of observations on a single variable. We have bivariate data when observations are made on each of two variables.

1.1 Populations, Samples,... Descriptive statistics help summarize and describe important features of the data. Some are graphical in nature: histograms, boxplots, scatter plots,... Other involve calculation of numerical summary measures, such as means, standard deviations, and correlation coefficients. Software:R, S-Plus, Minitab, SAS, SPSS, Jmp, Stata,...

1.1 Populations, Samples,... Scope of Modern Statistics molecular biology (analysis of microarray data, SNPs,...) ecology (describing quantitatively how individuals in various animal and plant populations are spatially distributed) materials engineering (studying properties of various treatments to retard corrosion) marketing (developing market surveys and strategies for marketing new products) public health (identifying sources of diseases and ways to treat them) civil engineering (assessing the effects of stress on structural elements and the impacts of traffic flows on communities)... Meanwhile, statisticians continue to develop new models for describing randomness, and uncertainty and new methodology for analyzing data.

1.1 Populations, Samples,... Data Collecting Statistics deals not only with the organization and analysis of data once it has been collected but also with the development of techniques for collecting the data. Data not properly collected may be useless and misleading. Appropriate sampling scheme must be used. simple random sample stratified random sample...

1.2 Pictorial and Tabular Methods in Descriptive Statistics Visual representation of data Stem-and-Leaf Displays Dotplots Histograms Boxplots Frequency tables Pie charts Bar graphs Scatter plots...

1.2 Pictorial and Tabular Methods in Descriptive Statistics Stem-and-Leaf Displays Example: data 0.0-0.2-1.1-0.6-2.3 0.5-0.3 1.5 1.0 1.0 0.6-1.1-0.9-1.2 0.3 1.4 0.4-1.1 0.0 1.1 2.0-0.2 0.3-0.2 0.7 0.1-0.8 0.3 0.3 0.4-2 3-1 -1 2111-0 986-0 3222 0 001333344 0 567 1 0014 1 5 2 0

1.2 Pictorial and Tabular Methods in Descriptive Statistics Histograms

Wednesday, Sept 7 To cover... 1.3 Measure of Location 1.4 Measure of Variability

1.3 Measures of Location Some measures of location are: Mean Median Quartiles Percentiles Trimmed Means Data: Let X be the variable of interest. x 1, x 2,..., x n are observations X ; n is the number of observations, or sample size, or number of samples.

1.3 Measures of Location Data example: Caustic stress corrosion cracking of iron and steel has been studied because of failures around rivets in steel boilers and failures of steam rotors. Let X be the crack length (µm) as a result of constant load stress corrosion tests on smooth bar tensile specimens for a fixed length of time. x 1 = 16.1; x 2 = 9.6; x 3 = 24.9; x 4 = 20.4; x 5 = 12.7 x 6 = 21.2; x 7 = 30.2; x 8 = 25.8; x 9 = 18.5; x 10 = 10.3 x 11 = 25.3; x 12 = 14.0; x 13 = 27.1; x 14 = 45.0; x 15 = 23.3 x 16 = 24.2; x 17 = 14.6; x 18 = 8.9; x 19 = 32.4; x 20 = 11.8; x 21 = 28.5 The sample size is n = 21.

Mean The mean or the arithmetic average of the set is the most familiar and useful measure of the center. Let x 1, x 2,..., x n be a given set of numbers. The sample mean is denoted by x. If the set is y 1,..., y n, the sample mean is ȳ. Definition The sample mean x of observations x 1, x 2,..., x n, is given by x = 1 n n (x i=1 1 + x 2 +... + x n ) = x i n Example: x 1 = 16.1; x 2 = 9.6; x 3 = 24.9; x 4 = 20.4; x 5 = 12.7 The mean is x = (16.1 + 9.6 + 24.9 + 20.4 + 12.7)/5 = 16.7 (1)

Mean... Sample mean of x 1, x 2,..., x n : x The population mean is often denoted by µ. Let N be the total number of observations in the population. The population mean can be obtained as µ = (sum of the N population values)/n. (2) There is more to this population mean! A general definition for µ that applies to both finite and (conceptually) infinite populations will be visited later. Just as x is an interesting and important measure of sample location, µ is an interesting and important (often the most important) characteristic of a population.

Median Sample median is the middle value once the observations are ordered from smallest to largest. Notation: Denote observations by x 1,..., x n. The sample median is represented by x. Definition The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included so that every sample observation appears in the ordered list). Then, { x = ( n+1 2 )th ordered value if n is odd average of ( n 2 )th and( n 2 + 1)th ordered values ifn is even. (3)

Median Example: A sample of n = 12 recordings of Beethovens Symphony #9, yielding the following durations (min) listed in increasing order: 62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0 The sample median is the average of the n/2 = 6 th and (n/2 + 1) = 7 th values from the ordered list: x = (66.4 + 67.4)/2 = 66.9 Notes: If the largest observation 79.0 was excluded from the sample, the resulting sample median for the n = 11 remaining observations would have been the single middle value 66.4 (the [n + 1]/2 = 6 th ordered value, i.e. the 6 th value in from either end of the ordered list). The sample mean is x = 68.01, a bit more than a full minute larger than the median.

Median Remarks and Notation: The population median is denoted by µ The sample median is very insensitive to outliers. If the median salary for a sample of engineers were x = 66, 416, we might use this as a basis for concluding that the median salary for all engineers exceeds 60, 000. The population mean µ and median µ will not generally be identical. When this is the case, in making inferences we must first decide which of the two population characteristics is of greater interest and then proceed accordingly.

Quartiles, Percentiles, and Trimmed Means Quartiles divide the data set (sample or population) into four equal parts. Observations above the third quartile Q3 constituting the upper quarter of the data set. The second quartile Q2 is the median. The first quartile Q1 separates the lower quarter from the upper three-quarters. Example: Beethovens Symphony #9 data - Q1 = 64.80; Q2 = x = 66.90; Q3 = 69.30

Quartiles, Percentiles, and Trimmed Means A data set (sample or population) can be even more finely divided using percentiles; the 99 th percentile separates the highest 1% from the bottom 99%, and so on. The mean is quite sensitive to a single outlier, whereas the median is not affected by outliers. A trimmed mean is a compromise between x and x to the robustness to outliers. A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains.

Mean, Median, Quartiles, Percentiles, and Trimmed Means Example: The production of Bidri is a traditional craft of India. Bidri wares (bowls, vessels,...) are cast from an alloy containing primarily zinc along with some copper. The following observations are on copper content (%) for a sample of Bidri artifacts: 2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3 3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1 Stem-and-Leaf display 2 04566778 3 012334466667 4 4678 5 3 6 7 8 9 10 1

Mean, Median, Quartiles, Percentiles, and Trimmed Means A prominent feature is the single outlier at the upper end. The sample mean and median are 3.65 and 3.35, respectively. A trimmed mean with a trimming percentage of 100(2/26) = 7.7% results from eliminating the two smallest and two largest observations; this gives x tr(7.7) = 3.42 Trimming here eliminates the larger outlier and so pulls the trimmed mean toward the median. A trimmed mean with a moderate trimming percentage (between 5% and 25%) will yield a measure of center that is neither as sensitive to outliers as is the mean nor as insensitive as the median. If the desired trimming percentage is 100α% and nα is not an integer, the trimmed mean must be calculated by interpolation.

If we let x denote the number in the sample falling in category 1, then the number in category 2 is nx. The relative frequency or sample proportion in category 1 is x/n and the sample proportion in category 2 is 1 x/n. Categorical Data and Sample Proportions Example: If a survey of individuals who own digital cameras is undertaken to study brand preference, then each individual in the sample would identify the brand of camera that he or she owned, from which we could count the number owning Canon, Sony, Kodak, and so on. When the data is categorical, a frequency distribution or relative frequency distribution provides an effective tabular summary of the data. Consider sampling a dichotomous populationone that consists of only two categories (such as voted or did not vote in the last election, does or does not own a digital camera, etc.).

Categorical Data and Sample Proportions Lets denote a response that falls in category 1 by a 1 and a response that falls in category 2 by a 0. A sample size of n = 10 might then yield the responses 1, 1, 0, 1, 1, 1, 0, 0, 1, 1. The sample mean for this numerical sample is (since number of 1s is x = 7) x n = x 1 + x 2 +... + x n n 1, +1 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 1 = 10 (4) = 7 10. (5) The sample proportion of observations in the category is the sample mean of the sequence of 1s and 0s. Thus a sample mean can be used to summarize the results of a categorical sample. Analogous to the sample proportion x/n of individuals or objects falling in a particular category, let p represent the proportion of those in the entire population falling in the category.

Categorical Data and Sample Proportions As with x/n, p is a quantity between 0 and 1, and while x/n is a sample characteristic, p is a characteristic of the population. The relationship between the two parallels the relationship between x and µ and between x and µ. In particular, we will subsequently use x/n to make inferences about p. Example: A sample of 100 car owners reveals that 22 owned their car at least 5 years, then we might use 22/100 =.22 as a point estimate of the proportion of all owners who have owned their car at least 5 years. With k categories (k > 2), we can use the k sample proportions to answer questions about the population proportions p 1,..., p k.

1.4 Measures of Variability Some measures of variability Range (min max) Interquartile Range (Q3 Q2) Variance or standard deviation Different samples or populations may have identical measures of center yet differ from one another in other important ways. Example:

Sample Variance Definition The sample variance of x 1, x 2,..., x n, denoted by s 2, is given by s 2 = 1 n 1 n (x i x) 2 (6) The sample standard deviation, denoted by s, is the (positive) square root of the variance: i=1 s = s 2 (7) Remarks: s 2 and s are both nonnegative. The unit for s is the same as the unit for each of the x i s.

Sample Variance Computing remark: s 2 = 1 n 1 n (x i x) 2 = i=1 = [ n 1 n 1 i=1 [ n 1 n 1 i=1 x 2 i n( x) 2 ] x 2 i (8) ] ( n i=1 x i) 2.(9) n Example: Find the variance and standard deviation of 154 142 137 133 122 126 135 135 108 120 127 134 122 Step 1: Form and find n i=1 x 2 i = 154 2 + 142 2 +... + 122 2 = 222581 Step 2: With x = 130.4, calculate n( x) 2 = 1695 Step 3: The variance is s 2 = (222581 1695)/12 = 131.6 The standard deviation is s = 131.6 = 11.5.

Mean and Variance Proposition Let x 1, x 2,..., x n be a sample and c be any nonzero constant. If y i = x i + c for i = 1,..., n, then ȳ = x + c and s 2 y = s 2 x. If y i = cx i for i = 1,..., n, then ȳ = c x, s 2 y = c 2 s 2 x, and s y = c s x. where s 2 y and s 2 x are the sample variances for respectively the x s and y s.

Five-Number Summaries and Boxplots With x 1, x 2,..., x n, the five-number summary is given by (minimum, first quartile Q1, median, third quartile Q3, maximum) (smallest x i, lower fourth, median, upper fourth, largest x i ) A boxplot (aka box-and-whisker plot) is a way of graphically depicting groups of numerical data through their five-number summaries. Remark: A boxplot may also indicate which observations, if any, might be considered outliers. Using the Bidri data set, we have

Five-Number Summaries and Boxplots

Comparative Boxplots Suppose we have two sets of data as x: 8.87 4.98 11.23 21.03 10.33-4.03 9.70 7.67 11.64 1.73 1.78 4.83-1.63 13.52 4.12 5.69 13.91 7.56 17.15 8.84 13.08 8.18 10.28 4.67 16.54 12.18 2.97 9.35 10.70 10.91 y: 8.94 5.61 6.44 15.38 6.60 5.81 9.33 10.93 7.69 4.98 15.16 7.87 6.04 4.74 4.81 8.68 5.12 8.93 18.89 8.33 4.10 11.77 8.37 6.50 3.90 11.98 8.02 5.89 6.35 8.43

Exercise 78 Consider a sample x 1,..., x n and suppose that the values of x, s 2 x, and s x have been calculated. a. Let y i = x i x for i = 1,..., n. How do the values s 2 y and s y for the y i s compare to s 2 x, and s x? Explain or justify. b. Let z i = (x i x)/s x for i = 1,..., n. What are s 2 z and s z, the variance and standard deviation for the z i s?