# A Review of Statistical Outlier Methods

Save this PDF as:

Size: px
Start display at page:

Download "A Review of Statistical Outlier Methods"

## Transcription

1 Page 1 of 5 A Review of Statistical Outlier Methods Nov 2, 2006 By: Steven Walfish Pharmaceutical Technology Statistical outlier detection has become a popular topic as a result of the US Food and Drug Administration's out of specification (OOS) guidance and increasing emphasis on the OOS procedures of pharmaceutical companies. When a test fails to meet its specifications, the initial response is to conduct a laboratory investigation to seek an assignable cause. As part of that investigation, an analyst looks for an observation in the data that could be classified as an outlier. The FDA guidance "Investigating Out of Specification (OOS) Test Results for Pharmaceutical Production" and the US Pharmacopeia are clear that a chemical result cannot be omitted with an outlier test, but that a bioassay can be omitted with an outlier test (1). The two areas specifically prohibited from outlier tests are content uniformity and dissolution testing. An outlier is defined as an observation that "appears" to be inconsistent with other observations in the data set (2). An outlier has a low probability that it originates from the same statistical distribution as the other observations in the data set. On the other hand, an extreme value is an observation that might have a low probability of occurrence but cannot be statistically shown to originate from a different distribution than the rest of the data. Why study outliers Outliers can provide useful information about the process. An outlier can be created by a shift in the location (mean) or in the scale (variability) of the process. Though an observation in a particular sample might be a candidate as an outlier, the process might have shifted. Sometimes, the spurious result is a gross recording error or a measurement error. Measurement systems should be shown to be capable for the process they measure. Outliers also come from incorrect specifications that are based on the wrong distributional assumptions at the time the specifications are generated. How to handle outliers Once an observation is identified by means of graphical or visual inspection as a potential outlier, root cause analysis should begin to determine whether an assignable cause can be found for the spurious result. If no root cause can be determined, and a retest can be justified, the potential outlier should be recorded for future evaluation as more data become available. Often, values that seem to be outliers are the right tail of a skewed distribution. When reporting results, it is prudent to report conclusions with and without the suspected outlier in the analysis. Removing data points on the basis of statistical analysis without an assignable cause is not sufficient to throw data away. Robust or nonparametric statistical methods are alternative methods for analysis. Robust statistical methods such as weighted least-squares regression minimize the effect of an outlier observation (3). There are various approaches to outlier detection depending on the application and number of observations in the data set. Iglewicz and Hoaglin provide a comprehensive text about labeling, accommodation, and identification of outliers (4). Visual inspection alone cannot always identify an outlier and can lead to mislabeling an observation as an outlier. Using a specific function of the observations leads to a superior outlier labeling rule. Because data are used in estimation with classical measures such as the mean being highly sensitive to outliers, statistical methods were developed to accommodate outliers and to reduce their impact on the

2 Page 2 of 5 analysis. Some of the more commonly used identification methods are discussed in this article. Box plot A box plot is a graphical representation of dispersion of the data. The graphic represents the lower quartile (Q 1 ) and upper quartile (Q 3 ) along with the median. The median is the 50th percentile of the data. A lower quartile is the 25th percentile, and the upper quartile is the 75th percentile. The upper and lower fences usually are set a fixed distance from the interquartile range (Q 3 Q 1 ). Figure 1 shows the upper and lower fences to be set at 1.5 times the interquartile range. Any observation outside these fences is considered a potential outlier. Even when data are not normally distributed, a box plot can be used because it depends on the median and not the mean of the data. Trimmed means Figure 1 A trimmed mean is calculated by discarding a certain percentage of the lowest and the highest scores and then computing the mean of the remaining scores. The trimmed mean has the advantage of being relatively resistant to outliers. When outliers are present in the data, trimmed means are robust estimators of the population mean that are relatively insensitive to the outlying values. After viewing the box plot, a potential outlier might be identified. If the upper and lower 5% of the data are removed, then it creates a 10% trimmed mean. An example of trimmed means is the recent change in the Olympic scoring system for ice skating, in which the highest and lowest scores are eliminated, and the mean of the remaining scores is used to assess skaters' scores. If a trimmed mean is presented, the untrimmed mean should be presented for comparison. Extreme studentized deviate The extreme studentized deviate (ESD) test is quite good at identifying a single outlier in a normal sample. The maximum deviation from the mean: is calculated and compared with a tabled value (see Table I). If the maximum deviation is greater than the tabled value, then the observation is removed, and the procedure is repeated. If no observation exceeds the tabled value, then we cannot claim there is a statistical outlier. A downfall of this method is that it requires the assumption of a normal data distribution. Usually, this assumption holds true as the sample size gets larger, though a formal test such as the Andersen Darling method can be used to test the assumption (5). This approach can be generalized to investigate multiple outliers simultaneously. Table I is an example of 10 observations (raw data). Based on Table II, the critical value for N = 10 at an α level of 0.05 is Therefore, the data value 16.3 is an outlier because it corresponds to a studentized deviation of 2.49, which exceeds the 2.29 critical value.

3 Page 3 of 5 Table I: Example of extreme studentized deviate test. Table II: Critical values for the extreme studentized deviate test (Reference 4). Dixon-type tests Dixon-type tests are based on the ratio of the ranges. These tests are flexible enough to allow for specific observations to be tested. They also perform well with small sample sizes. Because they are based on ordered statistics, there is no need to assume normality of the data. Depending on the number of suspected outliers, different ratios are used to identify potential outliers. The first class of ratios, r 10, is used when the suspected outlier is the largest or smallest observation. The second set of ratios, r 11, is used when the potential observation is the second smallest or second largest. Situations like these arise because of masking. Masking occurs when several observations are close together, but the group of observations is still outlying from the rest of the data. Masking is a common phenomenon especially for bimodal data (i.e., data from two distributions). There are additional sets of ratios depending on how many masked points are excluded (6). The following equations are the r 10 and r 11 ratios: Testing the largest observation as an outlier: Testing the smallest observation as an outlier: Testing the largest observation as an outlier avoiding the smallest observation: Testing the smallest observation as an outlier avoiding the largest observation:

4 Page 4 of 5 If the distance between the potential outlier to its nearest neighbor is large enough, it would be considered an outlier. Table III shows the critical values for r 10 and r 11 ratios. Using the data set in Table I, the Dixon-type test can be used to to determine whether 16.3 is a potential outlier. For r 10, the test statistic is ( )/( ) which is equal to and is greater than the tabled value of Therefore, for a sample size of 10, 16.3 is a statistical outlier. Outliers in regression Table III: Critical value for Dixon tests (α = 0.05) (Reference 6). Regression analysis or least-squares estimation is a statistical technique to estimate a linear relationship between two variables. This technique is highly sensitive to outliers and influential observations. Outliers in regression can overstate the coefficient of determination (R 2 ), give erroneous values for the slope and intercept, and in many cases lead to false conclusions about the model. Outliers in regression usually are detected with graphical methods such as residual plots including deleted residuals. A common statistical test is Cook's distance measure, which provides an overall measure of the impact of an observation on the estimated regression coefficient (7). Just because the residual plot or Cook's distance measure test identifies an observation as an outlier does not mean one should automatically eliminate the point. One should fit the regression equation with and without the suspect point and compare the coefficients of the model and review the mean-squared error and R 2 from the two models. Summary Various methods for detecting outliers are used several times during an analysis. The first step in outlier detection is to plot the data using Table IV: Comparison of methods. methods such as histograms, scatter plots, or a box plot. The extreme studentized deviate test is an excellent test for data sets that have more than 10 observations from a normally distributed sample. For less than 10 observations, the Dixon test is a good method and does not require distributional assumptions. When performing regression analysis, always review residual plots to ensure no outliers are affecting the model coefficients and inflating the R 2 value. Table IV summarizes these methods and their ease of use. Steven Walfish is the president of Statistical Outsourcing Services, 403 King Farm Blvd., Suite 201, Rockville, MD 20850, tel , fax , Submitted: June 12, Accepted: Aug. 14, Keywords: analytical testing, manufacturing, statistics References 1. USP 29 NF 24 (US Pharmacopeial Convention, Rockville, MD, 2006). 2. V. Barnett and T. Lewis, Outliers in Statistical Data (John Wiley & Sons, 2d ed., New York, NY, 1985).

5 Page 5 of 5 3. P.J. Rousseeuw and A.M. Leroy, Robust Regression and Outlier Detection (John Wiley & Sons, New York, NY, 1987). 4. B. Iglewicz and D.C. Hoaglin, How to Detect and Handle Outliers (American Society for Quality Control, Milwaukee, WI, 1993). 5. M.A. Stephens, "Tests Based on EDF Statistics," in Goodness-of-Fit Techniques, R.B. D'Agostino and M.A. Stephens, Eds. (Marcel Dekker, New York, NY, 1986). 6. CRC Standard Probability and Statistics, W.H. Beyer, Ed. (CRC Press, Boca Raton, FL). 7. J. Neter et al., Applied Linear Statistical Models (Irwin, Chicago, IL, 1985).

### How Far is too Far? Statistical Outlier Detection

How Far is too Far? Statistical Outlier Detection Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 30-325-329 Outline What is an Outlier, and Why are

### We will use the following data sets to illustrate measures of center. DATA SET 1 The following are test scores from a class of 20 students:

MODE The mode of the sample is the value of the variable having the greatest frequency. Example: Obtain the mode for Data Set 1 77 For a grouped frequency distribution, the modal class is the class having

### Analytical Methods: A Statistical Perspective on the ICH Q2A and Q2B Guidelines for Validation of Analytical Methods

Page 1 of 6 Analytical Methods: A Statistical Perspective on the ICH Q2A and Q2B Guidelines for Validation of Analytical Methods Dec 1, 2006 By: Steven Walfish BioPharm International ABSTRACT Vagueness

### UNDERSTANDING MULTIPLE REGRESSION

UNDERSTANDING Multiple regression analysis (MRA) is any of several related statistical methods for evaluating the effects of more than one independent (or predictor) variable on a dependent (or outcome)

### Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA

Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA Introduction A common question faced by many actuaries when selecting loss development factors is whether to base the selected

### Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

### Applying Statistics Recommended by Regulatory Documents

Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven

### Data Mining Part 2. Data Understanding and Preparation 2.1 Data Understanding Spring 2010

Data Mining Part 2. and Preparation 2.1 Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Outline Introduction Measuring the Central Tendency Measuring the Dispersion of Data Graphic Displays References

### 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

### Geostatistics Exploratory Analysis

Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

### 2008 USPC Official 12/1/07-4/30/08 General Chapters: <1010> ANALYTICAL DA ANALYTICAL DATA INTERPRETATION AND TREATMENT INTRODUCTION

2008 USPC Official 12/1/07-4/30/08 General Chapters: ANALYTICAL DA... Page 1 of 29 1010 ANALYTICAL DATA INTERPRETATION AND TREATMENT INTRODUCTION This chapter provides information regarding acceptable

### Name: Date: Use the following to answer questions 2-3:

Name: Date: 1. A study is conducted on students taking a statistics class. Several variables are recorded in the survey. Identify each variable as categorical or quantitative. A) Type of car the student

### Numerical Summarization of Data OPRE 6301

Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting

### Exercise 1.12 (Pg. 22-23)

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

### Can Gas Prices be Predicted?

Can Gas Prices be Predicted? Chris Vaughan, Reynolds High School A Statistical Analysis of Annual Gas Prices from 1976-2005 Level/Course: This lesson can be used and modified for teaching High School Math,

### 1.5 NUMERICAL REPRESENTATION OF DATA (Sample Statistics)

1.5 NUMERICAL REPRESENTATION OF DATA (Sample Statistics) As well as displaying data graphically we will often wish to summarise it numerically particularly if we wish to compare two or more data sets.

### Descriptive Statistics. Understanding Data: Categorical Variables. Descriptive Statistics. Dataset: Shellfish Contamination

Descriptive Statistics Understanding Data: Dataset: Shellfish Contamination Location Year Species Species2 Method Metals Cadmium (mg kg - ) Chromium (mg kg - ) Copper (mg kg - ) Lead (mg kg - ) Mercury

### DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

### Numerical Measures of Central Tendency

Numerical Measures of Central Tendency Often, it is useful to have special numbers which summarize characteristics of a data set These numbers are called descriptive statistics or summary statistics. A

### Chapter 2: Exploring Data with Graphs and Numerical Summaries. Graphical Measures- Graphs are used to describe the shape of a data set.

Page 1 of 16 Chapter 2: Exploring Data with Graphs and Numerical Summaries Graphical Measures- Graphs are used to describe the shape of a data set. Section 1: Types of Variables In general, variable can

### Chapter 3: Central Tendency

Chapter 3: Central Tendency Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately describes the center of the distribution and represents

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Descriptive Statistics

Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

### Data Analysis: Describing Data - Descriptive Statistics

WHAT IT IS Return to Table of ontents Descriptive statistics include the numbers, tables, charts, and graphs used to describe, organize, summarize, and present raw data. Descriptive statistics are most

### Chapter 3 Descriptive Statistics: Numerical Measures. Learning objectives

Chapter 3 Descriptive Statistics: Numerical Measures Slide 1 Learning objectives 1. Single variable Part I (Basic) 1.1. How to calculate and use the measures of location 1.. How to calculate and use the

### F. Farrokhyar, MPhil, PhD, PDoc

Learning objectives Descriptive Statistics F. Farrokhyar, MPhil, PhD, PDoc To recognize different types of variables To learn how to appropriately explore your data How to display data using graphs How

### MEAN 34 + 31 + 37 + 44 + 38 + 34 + 42 + 34 + 43 + 41 = 378 MEDIAN

MEASURES OF CENTRAL TENDENCY MEASURES OF CENTRAL TENDENCY The measures of central tendency are numbers that locate the center of a set of data. The three most common measures of center are mean, median

### Statistics S1 Advanced Subsidiary

Paper Reference(s) 6683 Edexcel GCE Statistics S1 Advanced Subsidiary Thursday 9 June 2005 Morning Time: 1 hour 30 minutes Materials required for examination Mathematical Formulae (Lilac) Graph Paper (ASG2)

### Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test.

The Analysis of Research Data The design of any project will determine what sort of statistical tests you should perform on your data and how successful the data analysis will be. For example if you decide

### COMMON CORE STATE STANDARDS FOR

COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### ! x sum of the entries

3.1 Measures of Central Tendency (Page 1 of 16) 3.1 Measures of Central Tendency Mean, Median and Mode! x sum of the entries a. mean, x = = n number of entries Example 1 Find the mean of 26, 18, 12, 31,

### Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

### AP * Statistics Review. Descriptive Statistics

AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

### business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

### Describing Data. We find the position of the central observation using the formula: position number =

HOSP 1207 (Business Stats) Learning Centre Describing Data This worksheet focuses on describing data through measuring its central tendency and variability. These measurements will give us an idea of what

### Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

### This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

### Chapter 3: Data Description Numerical Methods

Chapter 3: Data Description Numerical Methods Learning Objectives Upon successful completion of Chapter 3, you will be able to: Summarize data using measures of central tendency, such as the mean, median,

### DETECTION AND TREATMENT OF OUTLIERS IN DATA SETS

Iraqi Journal of Statistical Science (9) 2006 P.P. [58-74] DETECTION AND TREATMENT OF OUTLIERS IN DATA SETS Tara Ahmed H. Chawsheen * Ivan Subhi Latif ** ABSTRACT In this paper, we shall try to determine

### II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

### DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics

### Descriptive statistics; Correlation and regression

Descriptive statistics; and regression Patrick Breheny September 16 Patrick Breheny STA 580: Biostatistics I 1/59 Tables and figures Descriptive statistics Histograms Numerical summaries Percentiles Human

### High School Algebra 1 Common Core Standards & Learning Targets

High School Algebra 1 Common Core Standards & Learning Targets Unit 1: Relationships between Quantities and Reasoning with Equations CCS Standards: Quantities N-Q.1. Use units as a way to understand problems

### Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at http://www.stats.ox.ac.uk/ marchini/phs.html

### 4. DESCRIPTIVE STATISTICS. Measures of Central Tendency (Location) Sample Mean

4. DESCRIPTIVE STATISTICS Descriptive Statistics is a body of techniques for summarizing and presenting the essential information in a data set. Eg: Here are daily high temperatures for Jan 6, 29 in U.S.

### Regression. In this class we will:

AMS 5 REGRESSION Regression The idea behind the calculation of the coefficient of correlation is that the scatter plot of the data corresponds to a cloud that follows a straight line. This idea can be

### Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

### By Hui Bian Office for Faculty Excellence

By Hui Bian Office for Faculty Excellence 1 Email: bianh@ecu.edu Phone: 328-5428 Location: 2307 Old Cafeteria Complex 2 When want to predict one variable from a combination of several variables. When want

### A frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes

A frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes together with the number of data values from the set that

### Box plots & t-tests. Example

Box plots & t-tests Box Plots Box plots are a graphical representation of your sample (easy to visualize descriptive statistics); they are also known as box-and-whisker diagrams. Any data that you can

Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

### Lecture 2. Summarizing the Sample

Lecture 2 Summarizing the Sample WARNING: Today s lecture may bore some of you It s (sort of) not my fault I m required to teach you about what we re going to cover today. I ll try to make it as exciting

### Descriptive Statistics

Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

### Stats Review Chapters 3-4

Stats Review Chapters 3-4 Created by Teri Johnson Math Coordinator, Mary Stangler Center for Academic Success Examples are taken from Statistics 4 E by Michael Sullivan, III And the corresponding Test

### EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

STP 231 EXAM #1 (Example) Instructor: Ela Jackiewicz Honor Statement: I have neither given nor received information regarding this exam, and I will not do so until all exams have been graded and returned.

### Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Center: Finding the Median When we think of a typical value, we usually look for the center of the distribution. For a unimodal, symmetric distribution, it s easy to find the center it s just the center

### FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

FE670 Algorithmic Trading Strategies Lecture 3. Factor Models and Their Estimation Steve Yang Stevens Institute of Technology 09/19/2013 Outline 1 Factor Based Trading 2 Risks to Trading Strategies 3 Desirable

### Content DESCRIPTIVE STATISTICS. Data & Statistic. Statistics. Example: DATA VS. STATISTIC VS. STATISTICS

Content DESCRIPTIVE STATISTICS Dr Najib Majdi bin Yaacob MD, MPH, DrPH (Epidemiology) USM Unit of Biostatistics & Research Methodology School of Medical Sciences Universiti Sains Malaysia. Introduction

### ESTIMATING THE DISTRIBUTION OF DEMAND USING BOUNDED SALES DATA

ESTIMATING THE DISTRIBUTION OF DEMAND USING BOUNDED SALES DATA Michael R. Middleton, McLaren School of Business, University of San Francisco 0 Fulton Street, San Francisco, CA -00 -- middleton@usfca.edu

### Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course

### 10-3 Measures of Central Tendency and Variation

10-3 Measures of Central Tendency and Variation So far, we have discussed some graphical methods of data description. Now, we will investigate how statements of central tendency and variation can be used.

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

### A.2 Measures of Central Tendency and Dispersion

Section A. Measures of Central Tendency and Dispersion A A. Measures of Central Tendency and Dispersion What you should learn How to find and interpret the mean, median, and mode of a set of data How to

### , then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

### Step-by-Step Analytical Methods Validation and Protocol in the Quality System Compliance Industry

Step-by-Step Analytical Methods Validation and Protocol in the Quality System Compliance Industry BY GHULAM A. SHABIR Introduction Methods Validation: Establishing documented evidence that provides a high

### Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Readings: Ha and Ha Textbook - Chapters 1 8 Appendix D & E (online) Plous - Chapters 10, 11, 12 and 14 Chapter 10: The Representativeness Heuristic Chapter 11: The Availability Heuristic Chapter 12: Probability

### UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test March 2014 STAB22H3 Statistics I Duration: 1 hour and 45 minutes Last Name: First Name: Student number: Aids

### Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY 1. Introduction Besides arriving at an appropriate expression of an average or consensus value for observations of a population, it is important to

### Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

### 3. Data Analysis, Statistics, and Probability

3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer

### The Big 50 Revision Guidelines for S1

The Big 50 Revision Guidelines for S1 If you can understand all of these you ll do very well 1. Know what is meant by a statistical model and the Modelling cycle of continuous refinement 2. Understand

### STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

### 2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

### Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

### Changes to UK NEQAS Leucocyte Immunophenotyping Chimerism Performance Monitoring Systems From April 2014. Uncontrolled Copy

Changes to UK NEQAS Leucocyte Immunophenotyping Chimerism Performance Monitoring Systems From April 2014 Contents 1. The need for change 2. Current systems 3. Proposed z-score system 4. Comparison of z-score

### Seminar paper Statistics

Seminar paper Statistics The seminar paper must contain: - the title page - the characterization of the data (origin, reason why you have chosen this analysis,...) - the list of the data (in the table)

### Outlier Detection in Multivariate Data

Applied Mathematical Sciences, Vol. 9, 2015, no. 47, 2317-2324 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2015.53213 Outlier Detection in Multivariate Data K. Senthamarai Kannan and K.

### GCSE HIGHER Statistics Key Facts

GCSE HIGHER Statistics Key Facts Collecting Data When writing questions for questionnaires, always ensure that: 1. the question is worded so that it will allow the recipient to give you the information

### Quantitative Data Analysis: Choosing a statistical test Prepared by the Office of Planning, Assessment, Research and Quality

Quantitative Data Analysis: Choosing a statistical test Prepared by the Office of Planning, Assessment, Research and Quality 1 To help choose which type of quantitative data analysis to use either before

### Multiple Regression: What Is It?

Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

### Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

### Final Exam Practice Problem Answers

Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

### Module 3: Correlation and Covariance

Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

### MEASURES OF LOCATION AND SPREAD

Paper TU04 An Overview of Non-parametric Tests in SAS : When, Why, and How Paul A. Pappas and Venita DePuy Durham, North Carolina, USA ABSTRACT Most commonly used statistical procedures are based on the

### x Measures of Central Tendency for Ungrouped Data Chapter 3 Numerical Descriptive Measures Example 3-1 Example 3-1: Solution

Chapter 3 umerical Descriptive Measures 3.1 Measures of Central Tendency for Ungrouped Data 3. Measures of Dispersion for Ungrouped Data 3.3 Mean, Variance, and Standard Deviation for Grouped Data 3.4

### Section 3.1 Measures of Central Tendency: Mode, Median, and Mean

Section 3.1 Measures of Central Tendency: Mode, Median, and Mean One number can be used to describe the entire sample or population. Such a number is called an average. There are many ways to compute averages,

### AP Statistics 2001 Solutions and Scoring Guidelines

AP Statistics 2001 Solutions and Scoring Guidelines The materials included in these files are intended for non-commercial use by AP teachers for course and exam preparation; permission for any other use

### Why factorial methods don t work well for mixtures. Case study illustrates how to apply mixture design

1 Find the Optimal Formulation for Mixtures Discover sweet spots where multiple product specifications can be met in a most desirable way. Mark J. Anderson and Patrick J. Whitcomb Stat-Ease, Inc. 2021

### Treatment and analysis of data Applied statistics Lecture 3: Sampling and descriptive statistics

Treatment and analysis of data Applied statistics Lecture 3: Sampling and descriptive statistics Topics covered: Parameters and statistics Sample mean and sample standard deviation Order statistics and

### Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

### Data Exploration Data Visualization

Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

### MTH 140 Statistics Videos

MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

### Chapter 15 Multiple Choice Questions (The answers are provided after the last question.)

Chapter 15 Multiple Choice Questions (The answers are provided after the last question.) 1. What is the median of the following set of scores? 18, 6, 12, 10, 14? a. 10 b. 14 c. 18 d. 12 2. Approximately

### Common Tools for Displaying and Communicating Data for Process Improvement

Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot

### Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

### Mean = (sum of the values / the number of the value) if probabilities are equal

Population Mean Mean = (sum of the values / the number of the value) if probabilities are equal Compute the population mean Population/Sample mean: 1. Collect the data 2. sum all the values in the population/sample.