Methods of Dealing with Values Below the Limit of Detection using SAS
|
|
|
- Pierce Allison
- 9 years ago
- Views:
Transcription
1 Methods of Dealing with Values Below the Limit of Detection using SAS Carry W. Croghan, US-EPA, Research Triangle Park, NC, and Peter P. Egeghy US-EPA, Las Vegas, NV ABSTRACT Due to limitations of chemical analysis procedures, small concentrations cannot be precisely measured. These concentrations are said to be below the limit of detection (LOD). In statistical analyses, these values are often censored and substituted with a constant value, such as half the LOD, the LOD divided by the square root of 2, or zero. These methods for handling below-detection values results in two distributions, a uniform distribution for those values below the LOD, and the true distribution. As a result, this can produce questionable descriptive statistics depending upon the percentage of values below the LOD. An alternative method uses the characteristics of the distribution of the values above the LOD to estimate the values below the LOD. This can be done with an extrapolation technique or maximum likelihood estimation. An example program using the same data is presented calculating the mean, standard deviation, t-test, and relative difference in the means for various methods and compares the results. The extrapolation and maximum likelihood estimate techniques have smaller error rates than all the standard replacement techniques. Although more computational, these methods produce more reliable descriptive statistics. INTRODUCTION A common problem faced by researchers analyzing data in the environmental and occupational health fields is how to appropriately handle observations reported to have nondetectable levels of a contaminant (Hornung and Reed, 1990). Even with technical advances in the areas of field sampling, sample processing protocols, and laboratory instrumentation, the desire to identify and measure contaminants down to extraordinarily low levels means that there will always be a threshold below which a value cannot be accurately quantified (Lambert et al., 1991). As a value approaches zero there is a point where any signal due to the contaminant cannot be differentiated from the background noise, i.e. the precision of the instrument/method is not sufficient to discern the presence of the compound and quantify it to an acceptable level of accuracy. While one can differentiate between an instrument limit of detection and method limit of detection, it is the method limit of detection that is relevant to this discussion. By strict definition, the limit of detection (LOD) is the level at which a measurement has a 95% probability of being different than zero (Taylor, 1987). There is no universal procedure for determining a method LOD, but it is broadly defined as the concentration corresponding to the mean blank response (that is, the mean response produced by blank samples) plus three standard deviations of the blank response. Various identification criteria (for example, retention time on a specific analytical column, proper ratio of ions measured by a mass spectrometer, etc.) must also be met to confirm the presence of the compound of interest. While it is almost always possible to calculate values below the method LOD, these values are not considered to be statistically different from zero, and are generally reported as below detection limit or nondetect. Although statistically different from zero, values near the LOD are generally less accurate and precise (that is, less reliable) than values that are much larger than the LOD. The limit of quantitation (LOQ) is the term often used by laboratories to indicate the smallest amount that they consider to be reliably quantifiable. There is often considerable confusion concerning the distinction between the two, but the LOQ is generally defined as some multiple of the LOD, usually three times the LOD. Estimates of concentrations above the LOD but below the LOQ are potentially problematic due to their higher relative uncertainty (compared to values above the LOQ); these values are rarely treated any differently from values above the LOQ. Unfortunately, it is not as easy to use below-lod values in data analysis compared to below-loq values. First, there is the standard laboratory practice of reporting nondetect instead of numbers for values below the LOD (often referred to as censoring ). Second, even if numbers were reported, they may be highly unreliable (though many researchers might prefer highly unreliable over nothing). Thus, a researcher faced with data containing nondetects must decide how to appropriately use them in statistical analyses. One might reason that since the actual values of these observations are extraordinarily small, they are unimportant. However, these observations can still have a large effect on the parameters of the distribution of the entire set of observations. Treating them incorrectly may introduce severe bias when estimating the mean and variance of the distribution (Lyles et al., 2001), which may consequently distort regression coefficients and their standard errors and reduce power in hypothesis tests (Hughes, 2000). Depending upon the severity of the censoring, the distributional statistics may be highly questionable. The problem of estimating the parameters of distributions with censored data has been extensively studied (Cohen, 1959; Gleit, 1985; Helsel and Guilliom, 1986; Helsel, 1990; Hornung and Reed, 1990; Lambert, 1991; Özkaynak et al., 1991; Finkelstein and Verma, 2001). The main approaches to handling left censored data are: 1) simple replacement, 2) extrapolation, and 3) maximum likelihood estimation. The most common, and easiest, strategy is simple replacement, where censored values are replaced with zero, with some fraction of the detection limit (usually either 1/2 or 1//2), or with the detection limit itself. Extrapolation strategies use regression or probability plotting techniques to calculate the mean and standard deviation based on the regression line of the observed, above-lod values and their rank scores. A third strategy, maximum likelihood estimation (MLE), is also based on the statistical properties of the noncensored portion of the data, but uses an iterative process to solve for the means and variance. Although the latter two methods do not generate replacement data for the censored observations, they can be modified to do so if needed. The SAS language allows for both the statistical analyses and data manipulations required to perform these techniques. Using SAS procedures and options for the statistical computations will reduce the necessary coding that other languages would require. The flexibility in outputting the information generated by the procedures, allows for the additional manipulations to be easily made within a datastep. The macro capability of SAS is pivotal for performing the testing and comparison of the different methods of handling censored data. Macros make repetitive processing of procedures and datasteps possible and allow for inputting parameters. The flexibility and robustness of the SAS language makes it a good choice for programming these types of data problems.
2 METHOD: To test the difference between the various techniques, distributions with known parameters are needed. Therefore, normally distributed datasets were generated with known means and standard deviations using the RANNOR function in SAS with numbers of observations of either 500 or The mean and standard deviation combinations were arbitrarily chosen to be 4 and 1.5, 10 and 5, and 50 and 5. Cut-off points were calculated using the 5, 10, 25, 50 and 75 percentiles for the distribution determined using a PROC MEANS statement. The cut-off points will be used as a surrogate for limit of detection. These datasets were used to test each of the methods of dealing with the censored values. To test the simple replacement techniques, a series of datasets were generated with the values below the cut-off points replaced using the standard replacement values of zero, cut-off value divided by two, or the cut-off value divided by the square root of two. The means for the new datasets were compared to those of the parent dataset using a student t-test. As was done in Newman et al. 1989, the significance level was set to 0.1. The percentage of times that the p-value was less than 0.1 was calculated for each cut-off level, the replacement technique, and the combination of cut-off level and replacement technique. The method of extrapolation is a method of calculating the mean and standard deviation based on the regression line of the values and the inverse cumulative normal distribution for a function of the rank score of those values not censored. To perform this method, first the values must be rank ordered including those that are censored. Given there are n samples of which k are censored, for the n-k values that are not censored, calculate the Blom estimate of the normal score: where A i = F -1 ((i-3/8)/ (n+1/4)), F -1 is the inverse cumulative normal distribution function i is the rank score for the value X i n is the size of the sample. In SAS this translates to proc rank data=one out=two normal=blom; var X; ranks A; Then regress the X values against A using a simple linear regression model: X i = a + b A i + e i. In SAS use the following code: proc reg data=two outest=stat2; model X = A; The mean of the noncensored distribution is estimated by intercept and the standard deviation is estimated by the slope of the line. where: µ= m σ (k/n-k) f(ε)/f(ε) σ 2 = [S 2 + (m-µ 2 )] / [1 + ε (k/n-k) f(ε)/f(ε)] ε= (LOD-µ)/ σ LOD= limit of detection f(k) = the distribution function for the normal distribution F(k) =the cumulative distribution function for the normal distribution, m= the mean of all values above the LOD S =The population standard deviation of all values above the LOD. In SAS this is done by using LIFEREG proc lifereg data=ttest1 outest=stat3; model (lower, Y) = / d=normal; where: if X > cuts then lower = X; else lower =.; Y = (X > cuts) * X + (X<=cuts)*cuts; (Y is X is X is greater than the cuts, otherwise Y is set to the cuts value) The methods are compared using two different error measurements. The first is based on the percentage of times the derived mean is significantly different from the true mean using a student t-test at the 0.1 level. The second is the based on the relative difference in the derived and true mean. The mean relative difference is calculated over all the distribution parameters. FINDINGS If only a small percentage of the values have been censored, there is little bias introduced by any of the replacement techniques (Table 1). At the 5 th percentile cut-off, the derived mean for these situations is never significantly different from the true mean for any of the replacement techniques. At the 10 th percentile, some significant differences are observed. The largest error rate (8%) is for replacement with zero for the combination of the following parameters, N=1000, mean=4, and standard deviation=1.5. Conversely, if the percentage of censored values is large, none of these replacement techniques are adequate. The percent of times that the true mean is significantly different from the derived mean ranges from to 100% over all the methods when the 75 th percentile is used as the cut-off. Even at the 50 th percentile, the error rate is high for some combinations of sample size, mean and standard deviation. For N=1000, mean=50, and standard deviation =5 the error rate is 100%, indicating that all of the derived means are significantly different from the true mean. A difference in the replacements techniques is also obvious at the 25 th percentile cut-off level. Replacement with 0 has the highest error rate of 33.33%; replacement with the limit of detection divided by the square root of 2 has an error rate of 0. The overall error rate is 15%. The maximum likelihood estimator (MLE) of the mean and standard deviation are solved in an iterative process solving the following system of equations: 2
3 Table 1: Percentage of Student t-test p-values less than 0.10 for the replacement methods overall N=1000, overall mean=10, overall N=1000, mean=10, overall mean=50, overall N=1000, mean=50, overall Overall Sample Types overall There is a difference in the overall error rate for the different methods of replacement. Replacement with the limit of detection divided by the square root of two has the smallest overall error rate (21%); while the replacement with zero has almost twice the overall error rate (41%). The error rate is affected by the choice of mean and standard deviation. The mean of 10 and standard deviation of 5 produces the smallest overall error rate, 14%. Further investigation into the relationship between distribution parameters and the error rate is needed to assess the cause of this effect and how to interpret it. The extrapolation method (Table 2) is a marked improvement over the simple replacement techniques. The largest error rate is 18% at the 75 th percentile cut-off level. This method is not as sensitive to the mean and standard deviation of the original distribution. The overall error rate of 1.24% is markedly different from the rate of 20.90% from the best of the replacement techniques, LOD//2. Table 2: Percentage of Student t-test p-values less than 0.10 for the Extrapolation method Distributions Cut-off Level parameters overall N=1000, mean=10, N=1000, mean=10, mean=50, N=1000, mean=50, overall The method of maximum likelihood estimation (Table 3) has the best overall error rate (0.7%). Only at the 75 th percentile cut-off level does the error rate differ from zero. There is, however, an effect by the choice of mean and standard deviation on the error rate, as with the replacement techniques. The highest error rate observed is 11%. Table 3: Percentage of Student t-test p-values less than 0.10 for the Maximum Likelihood Estimate Distributions Cut-off Level parameters overall N=1000, mean=10, N=1000, mean=10, mean=50, N=1000, mean=50, overall Bias can also be measured with the percent relative difference between the derived and true means. Using this measure, the effect of choice of parameters is removed. There is little 3
4 difference between the extrapolation and the MLE techniques (Table 4). The replacement techniques performed worse than the computational methods, with the replacement by zero having the largest relative difference (Figure 1). The LOD/ 2 replacement has the smallest relative difference of all the replacement methods using these data. Table 4: Percent Relative difference between derived and true means Cut-off Replacement Technique Level MLE Ext Zero LOD/2 LOD/ Figure 1: Percent relative difference between the derived and true means In addition, the extrapolation technique is more intuitive and, thus, more transparent. Therefore, the extrapolation technique has some advantages over the maximum likelihood technique. REFERENCES Cohen, AC., Jr. Simplified estimators for the normal distribution when samples are singly censored or truncated. Technometrics 1: , Finkelstein MM, Verma DK. Exposure estimation in the presence of nondetectable values: another look. AIHAJ 62: , Gleit A. Estimation for small normal data sets with detection limits. Environ Sci Technol 19: , Helsel DR, Gilliom RJ. Estimation of distributional parameters for censored trace level water quality data: 2. Verification and applications. Water Resources J 22(2): , Helsel DR. Less than obvious: statistical treatment of data below the detection limit. Environ Sci Technol 24(12) , Hornung RW, Reed LD. Estimation of average concentration in the presence of nondetectable values. Appl Occup Environ Hyg. 5(1):46-51, Lambert D, Peterson B, Terpenning I. Nondetects, detection limits, and the probability of detection. J Am Stat Assoc 86(414): , Lyles RH, Fan D, Chuachoowong R. Correlation coefficient estimation involving a left censored laboratory assay variable. Statist Med 20: , Newman MC, Dixon PM, Looney BB, Pinder JE. Estimating Mean and Variance for Enviromental Samples with Below Detection Limit Observations. Water Resources Bulletin 25(4): , Özkaynak H, Xue Jianping, Butler DA, Haroun LA, MacDonell, MM, Fingleton DJ. Addressing data heterogeneity: lessons learned from a multimedia risk assessment. Presented at 84th Annual Meeting of the Air and Waste Management Association, Vancouver, BC, Taylor JK. Quality Assurance of Chemical Measurements. Chelsea, MI: Lewis Publishers, CONCLUSION Distributions of normally distributed data were generated with three different mean and standard deviation combinations. These data were censored from 5 to 75 percent to test some common techniques for handling left-censored data. Simple substitution techniques are adequate if only a small percentage of the values have been censored. The error rate and relative difference in the means both increase once the percentage of censored values reaches 25%. Replacement with the LOD//2 appears to be the best choice of replacement values. The reason for the large difference between the LOD/2 and LOD//2 estimates is not readily apparent. The extrapolation and maximum likelihood techniques produce estimates with smaller bias and error rates than those from the replacement techniques. Furthermore, the error rate is less affected by the distribution parameters with the both the extrapolation and maximum likelihood technique. The maximum likelihood technique appears to have worked best, overall. Since the maximum likelihood technique is an iterative method, there are situations were there may not be a solution; however, the extrapolation technique has a closed form solution. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Author Name: Carry W. Croghan Company: US-EPA Address: HEASD (MD E205-1) City state ZIP: RTP, NC Work Phone: (919) Fax: (919) [email protected] This paper has been reviewed in accordance with the United States Environmental Protection Agency s peer and administrative review policies and approved for presentation and publication. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. 4
5 ATTACHED CODE ***************************; proc means data=ttest1 noprint; by group; var below; output out=stat1 n=n sum=k; ***************************; proc rank data=two out=temp2n normal=blom; var x; ranks prob; *****************************************; proc reg data=temp3 outest=stat2 noprint; by group; model x = prob; *****************************************; *** Since there is only a mean and std and not data, the t-test is calculated within a datastep. ***; data testit; merge stat2 stat1 (keep=group n k) end=eof; replace= substr(group, 1, 1); level = substr(group, 2); ** Retaining the values for the first record. This is the parent distribution to which all others are compared. **; retain mean1 s1 n1 ss1; if _n_=1 then do; mean1= intercept; s1 = prob **2; n1 = n; SS1 = s1*(n1-1); end; else do; mean2= intercept; s2 = prob **2; n2 = n; SS2 = s2*(n2-1); 2); sp = (ss1 + ss2) / (n1 + n2 - *** Student t ***; t = (mean1 - mean2) / sqrt(sp**2/n1 + sp**2/n2); p = probt(t,n1+n2-2); ** Relative difference in the means **; bias = (mean2-mean1) / mean1; end; 5
Coefficient of Determination
Coefficient of Determination The coefficient of determination R 2 (or sometimes r 2 ) is another measure of how well the least squares equation ŷ = b 0 + b 1 x performs as a predictor of y. R 2 is computed
TEST REPORT: SIEVERS M-SERIES PERFORMANCE SPECIFICATIONS
TEST REPORT: SIEVERS M-SERIES PERFORMANCE SPECIFICATIONS. PURPOSE The purpose of this report is to verify the performance characteristics of the Sievers M-Series TOC Analyzers. Performance was quantified
Low-level concentrations of organic and
More Than Obvious: Better Methods for Interpreting NONDETECT DATA Low-level concentrations of organic and inorganic chemicals are often reported as nondetect or less-than values. These concentrations are
Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means
Lesson : Comparison of Population Means Part c: Comparison of Two- Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis
Imputing Missing Data using SAS
ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
Analytical Methods: A Statistical Perspective on the ICH Q2A and Q2B Guidelines for Validation of Analytical Methods
Page 1 of 6 Analytical Methods: A Statistical Perspective on the ICH Q2A and Q2B Guidelines for Validation of Analytical Methods Dec 1, 2006 By: Steven Walfish BioPharm International ABSTRACT Vagueness
SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.
SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation
Getting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
GUIDELINES FOR THE VALIDATION OF ANALYTICAL METHODS FOR ACTIVE CONSTITUENT, AGRICULTURAL AND VETERINARY CHEMICAL PRODUCTS.
GUIDELINES FOR THE VALIDATION OF ANALYTICAL METHODS FOR ACTIVE CONSTITUENT, AGRICULTURAL AND VETERINARY CHEMICAL PRODUCTS October 2004 APVMA PO Box E240 KINGSTON 2604 AUSTRALIA http://www.apvma.gov.au
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Using R for Linear Regression
Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Hedge Effectiveness Testing
Hedge Effectiveness Testing Using Regression Analysis Ira G. Kawaller, Ph.D. Kawaller & Company, LLC Reva B. Steinberg BDO Seidman LLP When companies use derivative instruments to hedge economic exposures,
Experiment #1, Analyze Data using Excel, Calculator and Graphs.
Physics 182 - Fall 2014 - Experiment #1 1 Experiment #1, Analyze Data using Excel, Calculator and Graphs. 1 Purpose (5 Points, Including Title. Points apply to your lab report.) Before we start measuring
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection
Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.
Sample Analysis Design Step 2 Calibration/Standard Preparation Choice of calibration method dependent upon several factors:
Step 2 Calibration/Standard Preparation Choice of calibration method dependent upon several factors: 1. potential matrix effects 2. number of samples 3. consistency of matrix across samples Step 2 Calibration/Standard
Factor Analysis. Factor Analysis
Factor Analysis Principal Components Analysis, e.g. of stock price movements, sometimes suggests that several variables may be responding to a small number of underlying forces. In the factor model, we
On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations
EPA/600/R-06/022 March 2006 www.epa.gov On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations Contract No. 68-W-04-005
A Primer on Forecasting Business Performance
A Primer on Forecasting Business Performance There are two common approaches to forecasting: qualitative and quantitative. Qualitative forecasting methods are important when historical data is not available.
To avoid potential inspection
Good Analytical Method Validation Practice Deriving Acceptance for the AMV Protocol: Part II To avoid potential inspection observations for test method validations by the regulatory agencies, it has become
Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
NCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the
The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy
BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.
Signal to Noise Instrumental Excel Assignment
Signal to Noise Instrumental Excel Assignment Instrumental methods, as all techniques involved in physical measurements, are limited by both the precision and accuracy. The precision and accuracy of a
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Two Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
Applying Statistics Recommended by Regulatory Documents
Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services [email protected] 301-325 325-31293129 About the Speaker Mr. Steven
SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria
Paper SA01_05 SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria Dennis J. Beal, Science Applications International Corporation, Oak Ridge, TN
Signal, Noise, and Detection Limits in Mass Spectrometry
Signal, Noise, and Detection Limits in Mass Spectrometry Technical Note Chemical Analysis Group Authors Greg Wells, Harry Prest, and Charles William Russ IV, Agilent Technologies, Inc. 2850 Centerville
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Paper D10 2009. Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI
Paper D10 2009 Ranking Predictors in Logistic Regression Doug Thompson, Assurant Health, Milwaukee, WI ABSTRACT There is little consensus on how best to rank predictors in logistic regression. This paper
INTRODUCTION TO MULTIPLE CORRELATION
CHAPTER 13 INTRODUCTION TO MULTIPLE CORRELATION Chapter 12 introduced you to the concept of partialling and how partialling could assist you in better interpreting the relationship between two primary
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
Validation and Calibration. Definitions and Terminology
Validation and Calibration Definitions and Terminology ACCEPTANCE CRITERIA: The specifications and acceptance/rejection criteria, such as acceptable quality level and unacceptable quality level, with an
A Property and Casualty Insurance Predictive Modeling Process in SAS
Paper 11422-2016 A Property and Casualty Insurance Predictive Modeling Process in SAS Mei Najim, Sedgwick Claim Management Services ABSTRACT Predictive analytics is an area that has been developing rapidly
Magruder Statistics & Data Analysis
Magruder Statistics & Data Analysis Caution: There will be Equations! Based Closely On: Program Model The International Harmonized Protocol for the Proficiency Testing of Analytical Laboratories, 2006
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number
1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression
ANALYTICAL METHODS INTERNATIONAL QUALITY SYSTEMS
VALIDATION OF ANALYTICAL METHODS 1 GERT BEUVING INTERNATIONAL PHARMACEUTICAL OPERATIONS TASKS: - Internal auditing - Auditing of suppliers and contract manufacturers - Preparing for and guiding of external
Algebra 1 Course Information
Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through
Notes on Applied Linear Regression
Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:
POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.
Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression
LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
Analytical Test Method Validation Report Template
Analytical Test Method Validation Report Template 1. Purpose The purpose of this Validation Summary Report is to summarize the finding of the validation of test method Determination of, following Validation
Rockefeller College University at Albany
Rockefeller College University at Albany PAD 705 Handout: Hypothesis Testing on Multiple Parameters In many cases we may wish to know whether two or more variables are jointly significant in a regression.
Method Validation/Verification. CAP/CLIA regulated methods at Texas Department of State Health Services Laboratory
Method Validation/Verification CAP/CLIA regulated methods at Texas Department of State Health Services Laboratory References Westgard J. O.: Basic Method Validation, Westgard Quality Corporation Sarewitz
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest
Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t
1.5 Oneway Analysis of Variance
Statistics: Rosie Cornish. 200. 1.5 Oneway Analysis of Variance 1 Introduction Oneway analysis of variance (ANOVA) is used to compare several means. This method is often used in scientific or medical experiments
Lecture 14: GLM Estimation and Logistic Regression
Lecture 14: GLM Estimation and Logistic Regression Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South
DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.
DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,
Economics of Strategy (ECON 4550) Maymester 2015 Applications of Regression Analysis
Economics of Strategy (ECON 4550) Maymester 015 Applications of Regression Analysis Reading: ACME Clinic (ECON 4550 Coursepak, Page 47) and Big Suzy s Snack Cakes (ECON 4550 Coursepak, Page 51) Definitions
Estimation of σ 2, the variance of ɛ
Estimation of σ 2, the variance of ɛ The variance of the errors σ 2 indicates how much observations deviate from the fitted surface. If σ 2 is small, parameters β 0, β 1,..., β k will be reliably estimated
Module 5: Statistical Analysis
Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the
QbD Approach to Assay Development and Method Validation
QbD Approach to Assay Development and Method Validation Thomas A. Little Ph.D. 11/105/2014 President Thomas A. Little Consulting 12401 N Wildflower Lane Highland, UT 84003 1-925-285-1847 [email protected]
Lin s Concordance Correlation Coefficient
NSS Statistical Software NSS.com hapter 30 Lin s oncordance orrelation oefficient Introduction This procedure calculates Lin s concordance correlation coefficient ( ) from a set of bivariate data. The
Chapter 6: Multivariate Cointegration Analysis
Chapter 6: Multivariate Cointegration Analysis 1 Contents: Lehrstuhl für Department Empirische of Wirtschaftsforschung Empirical Research and und Econometrics Ökonometrie VI. Multivariate Cointegration
Guidance for Industry
Guidance for Industry Q2B Validation of Analytical Procedures: Methodology November 1996 ICH Guidance for Industry Q2B Validation of Analytical Procedures: Methodology Additional copies are available from:
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.
Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged
Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc.
Paper 264-26 Model Fitting in PROC GENMOD Jean G. Orelien, Analytical Sciences, Inc. Abstract: There are several procedures in the SAS System for statistical modeling. Most statisticians who use the SAS
Ridge Regression. Patrick Breheny. September 1. Ridge regression Selection of λ Ridge regression in R/SAS
Ridge Regression Patrick Breheny September 1 Patrick Breheny BST 764: Applied Statistical Modeling 1/22 Ridge regression: Definition Definition and solution Properties As mentioned in the previous lecture,
2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
Discussion Section 4 ECON 139/239 2010 Summer Term II
Discussion Section 4 ECON 139/239 2010 Summer Term II 1. Let s use the CollegeDistance.csv data again. (a) An education advocacy group argues that, on average, a person s educational attainment would increase
Introduction to proc glm
Lab 7: Proc GLM and one-way ANOVA STT 422: Summer, 2004 Vince Melfi SAS has several procedures for analysis of variance models, including proc anova, proc glm, proc varcomp, and proc mixed. We mainly will
Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:
Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours
CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13
COMMON DESCRIPTIVE STATISTICS / 13 CHAPTER THREE COMMON DESCRIPTIVE STATISTICS The analysis of data begins with descriptive statistics such as the mean, median, mode, range, standard deviation, variance,
Concepts of Experimental Design
Design Institute for Six Sigma A SAS White Paper Table of Contents Introduction...1 Basic Concepts... 1 Designing an Experiment... 2 Write Down Research Problem and Questions... 2 Define Population...
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1. Page 1 of 11. EduPristine CMA - Part I
Index Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques... 1 EduPristine CMA - Part I Page 1 of 11 Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting
Geostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras [email protected]
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part
Nonlinear Regression Functions. SW Ch 8 1/54/
Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
Logs Transformation in a Regression Equation
Fall, 2001 1 Logs as the Predictor Logs Transformation in a Regression Equation The interpretation of the slope and intercept in a regression change when the predictor (X) is put on a log scale. In this
Investment Statistics: Definitions & Formulas
Investment Statistics: Definitions & Formulas The following are brief descriptions and formulas for the various statistics and calculations available within the ease Analytics system. Unless stated otherwise,
Chapter 1 Introduction. 1.1 Introduction
Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-
Determination of g using a spring
INTRODUCTION UNIVERSITY OF SURREY DEPARTMENT OF PHYSICS Level 1 Laboratory: Introduction Experiment Determination of g using a spring This experiment is designed to get you confident in using the quantitative
Multiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
Standard Deviation Estimator
CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
Assay Development and Method Validation Essentials
Assay Development and Method Validation Essentials Thomas A. Little Ph.D. 10/13/2012 President Thomas A. Little Consulting 12401 N Wildflower Lane Highland, UT 84003 1-925-285-1847 [email protected]
Variables Control Charts
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. Variables
Paper PO06. Randomization in Clinical Trial Studies
Paper PO06 Randomization in Clinical Trial Studies David Shen, WCI, Inc. Zaizai Lu, AstraZeneca Pharmaceuticals ABSTRACT Randomization is of central importance in clinical trials. It prevents selection
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
