Proficiency Testing with FAPAS Understanding PT Statistics Ken Mathieson Senior Proficiency Analyst
Statistics?!
Where you want to go? If you don t know where you are going, you won t get there! The International Harmonised Protocol for the Proficiency Testing of Analytical Chemistry Laboratories [1] The point of a PT is to give a performance assessment to its participants Against what standard will such a performance assessment be made?
Fitness-for-Purpose Fitness-for-purpose is at the heart of the statistical model used by FAPAS Definition: a simple expression that takes lots of words, much arm waving and at least one diagram to explain but once grasped, a very succinct way of conveying the concept of fitness-for-purpose
Making Data Usable All analyses are variable, you never get the same answer twice The end use of the data should dictate the limits of acceptable variability In simple terms the more effort you put in to the analysis, the lower the uncertainty of the final answer In contrast, the greater the effort, the greater the cost (time = money!)
Fitness-for-Purpose Illustrated Uncertainty Time Effort Expended
Fitness-for Purpose Quantified Fitness-for-purpose represents reasonable uncertainty a tolerance on a result that is small enough to make the data meaningful and useful To put it another way, it represents a range / spread of possible results. Statistically, a standard deviation, σ p the standard deviation for proficiency assessment
Std. Deviation for Proficiency Assessment FAPAS sets σ p using data external to the observed performance i.e. σ p is prescriptive NOT descriptive Not all PT schemes do this 902 901 900 899 898 850 870 890 910 930 950 902 901 900 899 898 850 870 890 910 930 950
Sources of σ p The inter-laboratory variance (Reproducibility) from a method validation study is an indicator of best practice sd R, RSD R, R Predictive models e.g. modified Horwitz equation [2] Expert judgement of what is fit-for-purpose
Horwitz (original) 200 500 10 13.8 ppth % ppm % 1 ppm Original Horwitz Equation 50 45 40 35 30 25 20 = 0.02 c 0.8495 σ RSD, % 15 10 5 0 500 ppb 200 ppb 150 ppb 120 ppb 100 ppb 50 ppb 20 ppb 10 ppb 1 ppb concentration
Modification Below 120ppb 50 45 40 35 30 25 Horwitz (<120ppb) Horwitz (original) σ = 0.22c RSD, % 20 15 10 5 0 200 ppb 150 ppb 120 ppb 100 ppb 50 ppb 20 ppb 10 ppb 1 ppb concentration
Modification Above 13.8% 12 10 σ = 0.01c 0.5 8 RSD, % 6 Horwitz (original) Horwitz (>13.8%) 4 2 0 1 % 10 % 13.8 % 20 % concentration 50 %
Are You Sitting Comfortably? Then I ll begin
Homogeneity Testing All test materials are heterogeneous What we want is sufficient homogeneity In other words: the differences between individual test portions must not be large enough to materially affect the outcome of the PT Otherwise the results would simply reflect the many levels in the test portions and not the accuracy of the participating lab
Testing for sufficient homogeneity - 1 The statistical test is based around ANOVA typically using the results from 10 samples analysed in duplicate For the statistics to be meaningful the results have to be obtained under specified conditions: random selection, random analytical order, same time, etc. i.e. under repeatability conditions
Testing for sufficient homogeneity - 2 It is NOT just a comparison of the variance within and between pairs Experience has shown that type of simple test is limited over-sensitive when repeatability is very good under-sensitive when repeatability is poor
Testing for sufficient homogeneity - 3 11 11 10.8 10.8 10.6 10.6 10.4 10.4 10.2 10 9.8 rep 1 rep 2 10.2 10 9.8 rep 1 rep 2 9.6 9.6 9.4 9.4 9.2 9.2 9 0 1 2 3 4 5 6 7 8 9 10 11 12 9 0 1 2 3 4 5 6 7 8 9 10 11 12
Testing for sufficient homogeneity - 4 FAPAS uses a more sophisticated protocol Fearn and Thompson [3] Seek to reject material that displays heterogeneity above a set limit limit derived from fitness-for-purpose Fully worked example in Prof. Thompson s published paper
Testing for sufficient homogeneity - 5 Assume homogeneity reasonable, lots of time and effort goes into making test materials Scrutinise the results for any obvious problems e.g. trends, possible outliers Use Cochran s test to formally check the variance of the worst pair
Testing for sufficient homogeneity - 6 18 250 17.5 240 17 230 16.5 220 16 210 15.5 rep 1 rep 2 200 rep 1 rep 2 15 190 14.5 180 14 170 13.5 160 13 0 1 2 3 4 5 6 7 8 9 10 11 12 150 0 1 2 3 4 5 6 7 8 9 10 11 12
Testing for sufficient homogeneity - 7 Carry out ANOVA to obtain: the analytical variance, san 2 the sampling variance, ssam 2 Calculate the allowable sampling variance σall 2 = 0.3σp Calculate critical value c = F1σall 2 + F2 san 2 (F1 and F2 from a given table) If s sam 2 > c then the test indicates is a lack of sufficient homogeneity
Testing for sufficient homogeneity - 8 6.8 6.6 6.4 6.2 6 5.8 rep 1 rep 2 5.6 5.4 5.2 5 0 1 2 3 4 5 6 7 8 9 10
The Assigned Value Note, the assigned value, not the true value the true value is an ideal we ll never know The use of the word assigned indicates we are setting the value The assigned value should be the best estimate of the true value
Deriving the Assigned Value FAPAS usually derives the assigned value from the consensus of submitted results other options are a cert. ref. or a formulation value Using the most appropriate measure of central tendency: robust mean [4, 5] median mode [6] but not necessarily in that order
Simple vs Robust Mean Descriptive Statistics Variable: ass. value Anderson-Darling Normality Test A-Squared: P-Value: 2.450 0.000 3 7 11 15 19 Mean StDev Variance Skewness Kurtosis N 8.11113 3.06012 9.36431 2.08479 7.53429 61 Robust Mean 7.82879 95% Confidence Interval for Mu Minimum 1st Quartile Median 3rd Quartile Maximum 3.4700 6.5900 7.8100 9.0000 21.0000 95% Confidence Interval for Mu 7.3274 8.8949 7.5 8.0 8.5 9.0 95% Confidence Interval for Sigma 2.5972 3.7255 95% Confidence Interval for Median 95% Confidence Interval for Median 7.4405 8.2757
Limitations of a Robust Mean / Median Descriptive Statistics Variable: afm1 Anderson-Darling Normality Test A-Squared: P-Value: 5.339 0.000 0.0 0.4 0.8 1.2 1.6 Mean StDev Variance Skewness Kurtosis N 0.248691 0.314717 9.90E-02 2.78245 8.74819 46 Robust Mean 0.184686 95% Confidence Interval for Mu Minimum 1st Quartile Median 3rd Quartile Maximum 0.01700 0.08100 0.11900 0.29500 1.60000 95% Confidence Interval for Mu 0.15523 0.34215 0.1 0.2 0.3 95% Confidence Interval for Sigma 0.26104 0.39639 95% Confidence Interval for Median 95% Confidence Interval for Median 0.08998 0.17190
Bump-hunting Adaptive kernel density plot - afm1 6 5 mode = 0.087012 4 Density 3 2 1 0 0.0 0.5 Analytical result 1.0
Identifying Poor Methodology Adaptive kernel density plot - chloride 0.005 0.004 Density 0.003 0.002 0.001 0.000 500 1000 Analytical result 1500
Poor or Just Different Performance? Adaptive kernel density plot - peanut prote 0.2 Density 0.1 0.0 0 10 Analytical result 20
z-scores (at last!) This is a score that compares a participant s result to the true value x - X Then standardises it against a measure of acceptable analytical variation (x - X)/sd
More formally z = ( x Xˆ σ p ) where : x = Xˆ σ p participant' s result = the assigned value = std dev for proficiency assessment
Non-Normal Distributions - 1 z-scores rely on the results being normally distributed Microbiological results are known to be nonnormally distributed (Poisson distribution) log-transformation prior to calculating z-scores
Non-Normal Distributions - 2 GeMMA PT results invariably are skewed, with a long tail to the high end Review [7] of two GM schemes, commissioned by UK Food Stds Agency, confirmed: the non-normal distributions log-transformation prior to calculating z-scores, as the most appropriate way to treat the results
Non-normal distributions - 3 More formally z = (log x log 10 10 σ p Xˆ ) where : x = Xˆ σ p participant' s result = the assigned value = std dev for proficiency assessment, expressed in log10
Understanding z-scores z-scores embody the concept of fitness for purpose If the level of the determinand and/or the allowable variation around this level are inappropriate for your work your z-scores have no worth e.g. oil content of soya beans, assaying this to determine its commercial value is not the same as checking the oil content for nutritional purposes
Interpreting z-scores - 1 z-scores look simple but z-scores are statistics and, as with any statistic, interpretation requires experience Such experience gives you the edge with your managers, competitors, customers and accreditation assessors
Interpreting z-scores - 2 Superficially z-scores can be interpreted as: z <= 2 satisfactory z >2 but <= 3 questionable z > 3 unsatisfactory However, there is more to it! You must consider the probabilities a questionable score has about a 1 in 20 chance of being a perfectly good result, from the edge of the distribution!
Interpreting z-scores - 3-4 std dev -3 std dev -2 std dev -1 std dev mean + 1std dev +2 std dev +3 std dev +4 stdev
Your z-score What is fit for YOUR purpose?
References [1] M. Thompson, S. Ellison and R. Wood, The International Harmonised Protocol for the Proficiency Testing of (Chemical) Analytical Laboratories, Pure Appl. Chem., Vol.78, No.1, pp.145 196, 2006 http://www.iupac.org/publications/pac/2006/pdf/7801x0145.pdf [2] M. Thompson, Recent trends in inter-laboratory precision at ppb and sub-ppb concentrations in relation to fitness for purpose criteria in proficiency testing, Analyst, 2000, 125, 385-386 [3] T. Fearn and M. Thompson, A New Test for Sufficient Homogeneity, Analyst, 2001, 126, 1414-1417 [4] Analytical Methods Committee, Robust Statistics How not to reject outliers Part 1. Basic Concepts, Analyst, 1989, 114, 1693-1697 [5] ISO 13528:2005, Statistical methods for use in proficiency testing by interlaboratory comparisons, Annex C [6] P.J. Lowthian, and M. Thompson, Bump-hunting for the proficiency tester searching for multimodality, Analyst, 2002, 127, 1359-1364 [7] Thompson, M., et al, 2006, Scoring in GMO Proficiency Tests based on log-transformed results, J. AOAC Int., 89(1), 232-239.