Multivariate Data Analysis In Practice 5th Edition

Similar documents

Monitoring chemical processes for early fault detection using multivariate data analysis methods

Chemometric Analysis for Spectroscopy

Multivariate Chemometric and Statistic Software Role in Process Analytical Technology

How To Use Mva And Doe

All-in-one Multivariate Data Analysis and Design of Experiments software

NIRCal Software data sheet

O2PLS for improved analysis and visualization of complex data

Multivariate Tools for Modern Pharmaceutical Control FDA Perspective

How To Understand Multivariate Models

Partial Least Squares (PLS) Regression.

Teaching Multivariate Analysis to Business-Major Students

Regression Modeling Strategies

SIMCA 14 MASTER YOUR DATA SIMCA THE STANDARD IN MULTIVARIATE DATA ANALYSIS

Statistics for Experimenters

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

All-in-one Multivariate Data Analysis and Design of Experiments software

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Overview of Factor Analysis

Azure Machine Learning, SQL Data Mining and R

Statistical Rules of Thumb

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Multivariate Analysis of Ecological Data

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Empirical Model-Building and Response Surfaces

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Multivariate Data Analysis

Asian Journal of Food and Agro-Industry ISSN Available online at

TRAINING SCHOOL IN EXPERIMENTAL DESIGN & STATISTICAL ANALYSIS OF BIOMEDICAL EXPERIMENTS

Data Mining and Visualization

Introduction to Engineering System Dynamics

1 st day Basic Training Course

Security Metrics. A Beginner's Guide. Caroline Wong. Mc Graw Hill. Singapore Sydney Toronto. Lisbon London Madrid Mexico City Milan New Delhi San Juan

An Introduction to Partial Least Squares Regression

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Multivariate Statistical Inference and Applications

4. Simple regression. QBUS6840 Predictive Analytics.

Time series experiments

Application Note. The Optimization of Injection Molding Processes Using Design of Experiments

Introduction to Principal Components and FactorAnalysis

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Analysis of Financial Time Series

Principal Component Analysis

RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE MATH 111H STATISTICS II HONORS

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial

HOW TO USE MINITAB: DESIGN OF EXPERIMENTS. Noelle M. Richard 08/27/14

THE STANDARD FOR DOCTORAL DEGREES IN LAW AT THE FACULTY OF LAW, UNIVERSITY OF TROMSØ

Computer-Aided Multivariate Analysis

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

Application of Automated Data Collection to Surface-Enhanced Raman Scattering (SERS)

Simple Predictive Analytics Curtis Seare

Graduate Certificate in Systems Engineering

Succession planning in Chinese family-owned businesses in Hong Kong: an exploratory study on critical success factors and successor selection criteria

Univariate and Multivariate Methods PEARSON. Addison Wesley

Data Visualization. Principles and Practice. Second Edition. Alexandru Telea

What Is School Mathematics?

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

How to report the percentage of explained common variance in exploratory factor analysis

Integrated Reservoir Asset Management

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

CITY UNIVERSITY OF HONG KONG 香港城市大學. Self-Organizing Map: Visualization and Data Handling 自組織神經網絡 : 可視化和數據處理

The electrical field produces a force that acts

Dimensionality Reduction: Principal Components Analysis

Why participation works

QUALITY MANAGEMENT IN VETERINARY TESTING LABORATORIES

MATHEMATICAL METHODS OF STATISTICS

Data Mining Techniques in CRM

Prerequisite: High School Chemistry.

A Comparison of Variable Selection Techniques for Credit Scoring

Advanced Topics in Statistical Process Control

Validation and Calibration. Definitions and Terminology

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Contents. List of Figures. List of Tables. List of Examples. Preface to Volume IV

Probability and Statistics

1) Chemical Engg. PEOs & POs Programme Educational Objectives

Design & Analysis of Ecological Data. Landscape of Statistical Methods...

Alignment and Preprocessing for Data Analysis

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

Design of Experiments for Analytical Method Development and Validation

D-optimal plans in observational studies

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

Introduction to Regression and Data Analysis

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Methods for Meta-analysis in Medical Research

Regression Analysis: A Complete Example

vii TABLE OF CONTENTS CHAPTER TITLE PAGE DECLARATION DEDICATION ACKNOWLEDGEMENT ABSTRACT ABSTRAK

How To Evaluate The Performance Of The Process Industry Supply Chain

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

CLUSTER ANALYSIS WITH R

Using Excel for Statistics Tips and Warnings

Exploratory Data Analysis with MATLAB

FT-NIR for Online Analysis in Polyol Production

Transcription:

Multivariate Data Analysis In Practice 5th Edition An Introduction to Multivariate Data Analysis and Experimental Design Kim H. Esbensen Ålborg University, Esbjerg with contributions from Dominique Guyot Frank Westad Lars P. Houmøller CAMO Software AS. Nedre Vollgate 8, N-0158, Oslo, NORWAY Tel: (47) 223 963 00 Fax: (47) 223 963 22 CAMO Software Inc. One Woodbridge Center, Suite 319, Woodbridge, NJ 07095, USA Tel: (732) 726 9200 Fax: (973) 556 1229 www.camo.com CAMO Software India Pvt. Ltd. 14 & 15, Krishna Reddy Colony Domlur Layout, Bangalore - 560 071, INDIA Tel: (91) 80 4125 4242 Fax: (91) 80 4125 4181

This book was produced using Doc-to-Help together with Microsoft Word. Visio and Excel were used to make some of the illustrations. The screen captures were taken with Paint Shop Pro. Trademark Acknowledgments Doc-To-Help is a trademark of WexTech Systems, Inc. Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word are trademarks of the Microsoft Corporation. PaintShop Pro is a trademark of JASC, Inc. Visio is a trademark of the Shapeware Corporation. Information in this book is subject to change without notice. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of CAMO Process AS. ISBN 82-993330-3-2 1994 2002 CAMO Process AS All rights reserved. 5th edition. Re-print December 2004

Preface iii Preface October 2001 Learning to do multivariate data analysis is in many ways like learning to drive a car: You are not let loose on the road without mandatory training, theoretical and practical, as required by current concern for traffic safety. As a minimum you need to know how a car functions and you need to know the traffic code. On the other hand, everybody would agree that it is first after having obtained your drivers license that the real practical learning begins. This is when your personal experience really starts to accumulate. There is a strong interaction between the theory absorbed and the practice gained in this secondary, personal training period. Please substitute multivariate data analysis for driving a car in all of the above. Neither in this context are you let out on the data analytical road without mandatory training, theoretical and practical. The analogy is actually very apt! This book presents a basic theoretical foundation for bilinear (projection-based) multivariate data modeling and gives a conceptual framework for starting to do your own data modeling on the data sets provided. There are some 25 data sets included in this training package. By doing all exercises included you re off to a flying start! Driving your newly acquired multivariate data analysis car is very much an evolutionary process: this introductory textbook is filled with illustrative examples, many practical exercises and a full set of selfexamination real-world data analysis problems (with corresponding data sets). If, after all of this, you are able to work confidently on your own applications, you ll have reached the goal set for this book.

iv Preface This is the 5 th revised edition of this book. The three first editions were mainly reprints, the only major change being the inclusion of a completely revised chapter on Introduction to experimental design, which first appeared in the 3 rd edition (CAMO). The 4 th revised edition however (published March 2000) saw very many major extensions and improvements: Text completely rewritten by the senior author, based on five years of extensive use in teaching at both university and dedicated course levels. More than 5.500 copies in use. 30% new theory & text material added, reflecting extensive student response, full integration of PCA, PLS1 & PLS2 NIPALS algorithms and explanations. Text revised with an augmented self-learning objective throughout. Four new master data sets added (with extended self-exercise potential): 1. Master violin data (PCA/PLS) 2. Norwegian car dealerships (PCA/PLS) 3. Vintages (PCA/PLS) 4. Acoustic chemometric calibration (PCR/PLS) Additional chapter on experimental design: new features include mixture designs and D-optimal designs. New chapter on the powerful, novel: Martens Uncertainty Test. Comprehensive glossary of terms. This 5th edition also includes essential additional revisions and improvements: Lars P. Houmøller, Ålborg University Esbjerg, has carried out a complete work-through of all demonstrations and exercises. Many of these had not been updated with respect to several of the intervening UNSCRAMBLER software versions. We are happy to have finally eliminated this most frustrating nuisance.

Preface v About the authors Kim H. Esbensen, Ph.D., has more than 20 years of experience in multivariate data analysis and applied chemometrics. He was professor in chemometrics at the Norwegian Telemark Institute of Technology (HIT/TF), Institute of Process Technology (PT) 1995-2001, where he was also head of the Chemometrics Department Tel-Tek, Telemark Industrial R&D Center, Porsgrunn. Between these institutions he founded ACRG: the Applied Chemometrics Research Group, HIT/TF- Tel-Tek, which a.o. hosted SSC6, the 6 th Scandinavian Symposium on Chemometrics, August 1999 as well as numerous other international courses, workshops and meetings. July 1 st, 2001 he moved to a position as research professor in Applied Chemometrics at Ålborg University, Esbjerg, Denmark (AUE), where he is currently leading ACACSRG: the Applied Chemometrics, Analytical Chemistry and Sampling Research Group. As the name implies, applied chemometrics activities continue in Esbjerg while new activities are added most notably through close collaboration with assoc. prof. Lars P. Houmøller, who independently built up the area of analytical chemistry/chemometrics at AUE before Prof. Esbensen s arrival. Most recently the discipline of sampling (proper sampling) has been added, in recognition of the immense importance of sampling in any data analytical discipline, including chemometrics. Kim H. Esbensen has published more than 60 papers and technical reports on a wide range of chemical, geochemical, industrial, technological, remote sensing, image analytic and acoustic chemometric applications. Together with Paul Geladi he has been instrumental in codeveloping the concept of Multivariate Image Analysis (MIA); with ACRG he pioneered the development of the novel area of acoustic chemometrics. His M. Sc. is from the University of Aarhus, Denmark in 1978 (geology, geochemistry), while a Ph.D. was conferred him by the Technical University of Denmark (DTH) in 1981 within the areas of metallurgy, meteoritics and multivariate data analysis. He then did post-doctoral work for two years with the Research Group for Chemometrics at the University of Umeå 1980-1981, after which he worked in a Swedish geochemical exploration company, Terra Swede, for two more years. Moving to Norway, this was followed by eight years as data analytical research scientist at the Norwegian Computing Center (NCC), Oslo,

vi Preface after which he became a senior research scientist at SINTEF, the Norwegian Foundation for Industrial and Technological Research for four additional years. In between these two assignments he was a visiting guest professor at Norsk Hydro s Research Center in Bergen, Norway. He also holds a position as Chercheur associé (now Chercheur affilié) du Centre de Recherche en Géomatique, Université Laval, Quebec. He is a member of the editorial board of Journal of Chemometrics, Wiley Publishers, and is a member of ICS, AGU and several other geological, data analytical and statistical associations. Dominique Guyot, educated in Statistics, Economics and Biomathematics (ENSAE and Université de Paris 7, France), has 15 years of experience in the field of chemometrics. She gained industrial experience from her work in the pharmaceutical and cosmetic industries, before joining CAMO from 1995 until 2000. With CAMO, Dominique worked as a Senior Consultant, and was particularly involved in food applications. She put together a practical strategy for efficient product development, based on experimental design and multivariate data analysis. This strategy was implemented in the Guideline + software package, complemented by an integrated training course focusing on multivariate methods for food product developers. Dominique is now studying music and singing at the Conservatoire of Trondheim, Norway. Frank Westad has a M. Sc. in physical chemistry from the University of Trondheim, Norway. He has 13 years experience in applied multivariate data analysis, and he completed a Ph.D. in multivariate regression in 2000. Frank has given numerous courses in experimental design and multivariate analysis for companies in Europe and in the U.S.A. His main research fields include variable selection, shift modelling and image analysis. Lars P. Houmøller has a M.Sc. in chemistry and physics from the University of Aarhus, Denmark. He has 12 years of experience in analytical chemistry and has worked 5-7 years with chemometrics. His teaching experiences include chemometrics, analytical chemistry, spectroscopy, physical chemistry, general and technical chemistry, organic and inorganic chemistry, unit operations and fluid dynamics. His research field covers NIR spectroscopic applications over a very broad industrial spectrum. He also has experience from working in the Danish food production industry.

Preface vii E-mail interaction with the authors: Kim Esbensen Dominique Guyot Frank Westad Lars P. Houmøller kes@aue.auc.dk dominique.guyot@camo.no fwestad@online.no lph@aue.auc.dk About this book Since 1986, when CAMO ASA first commercialized and started marketing THE UNSCRAMBLER, many customers have asked for basic, easy-to-understand literature on chemometrics. In 1993 a group of data analysts at different competence levels was invited to a one-day seminar at CAMO, Trondheim, for discussing their experience from both learning and teaching chemometrics. The result was a blue-print outline for what came to be this introductory book: the specifications called for a comprehensive training-package, involving basic, practical, easy-to-read, largely non-mathematical theory, with plenty of hands-on examples and exercises on real-world data sets. CAMO contracted SINTEF to write this book (first three editions), and the parties agreed to cooperate on the completion of the complete training package. In the intervening years, this book was published in some 4.500 copies and was used for the introductory basic training in some 15 universities and in several hundred industrial companies; reactions were many and largely constructive. We learned a lot from these criticisms; we thank all who contributed! Came 1999, the time was ripe for a complete revision of the entire package. This was undertaken by the senior author in the summer 1999 with significant assistance from his then Ph.D. student Jun Huang (now with CAMO, Norway); Frank Westad (Matforsk) who wrote chapter 14 (Martens Uncertainty Test), Dominique Guyot (CAMO) who wrote the original new entire chapter 17 (Complex Experimental Design Problems), and with further invaluable editorial and managerical contributions from Michael Byström (CAMO) and Valérie Lengard (CAMO). A most sincere thank you goes to Peter Hindmarch (CAMO, UK) for very effective linguistic streamlining of the 4 th edition! The authors and CAMO also take this opportunity to acknowledge Suzanne Schönkopf s (CAMO) contribution to editions previous to the 4 th one.

viii Preface The present edition of this book still bears the fruit of her very important past efforts. The publication of the 4 th edition, in March 2000, was unfortunately somewhat marred by a less than complete revision of the exercises and illustrative UNSCRAMBLER runs in the book, which was not considered fatal at the time This soon proved to be a serious mistake; disapointment and frustration from several generations of students, who wanted to follow all the exercises closely, followed rapidly. A Danish university teacher, who had himself experienced this frustration close up when using the book for his own teachings, assoc. prof. Lars P. Houmøller at the University of Ålborg, Esbjerg voluntarily took it upon himself to carry out a complete work-through of this essential didactic aspect of the book. His very valuable demo and exercise revisions, as well as a very thorough text consistency check, have now been included in toto in the 5 th edition. Today, this book is a collaborative effort between the senior author and CAMO Process AS; the tie with SINTEF is now defunct. There is little academic glamour in writing an introductory level textbook, as the senior author has well experienced - which was never the goal anyway. But on the other hand, the introductory level is definitely where the largest audience and potential market exist, as CAMO has well experienced. The senior author has used the book for six consecutive years teaching introductory chemometrics largely to engineering (M.Sc.) students, as well as for extensive course work in industrial and foreign university environments. The response from some accumulated 500 students has made this author happy, while some 5500 sales have made CAMO equally satisfied. Thus all is well with the training package! We hope that this revised 5 th edition will continue to meet the challenging demands of the market, hopefully now in an improved form. Writing for precisely this introductory audience/market constitutes the highest scientific and didactic challenge, and is thus (still) irresistible!

Preface ix Acknowledgements The authors wish to thank the following persons, institutions and companies for their very valuable help in the preparation of this training package: Hans Blom, Østlandskonsult AS, Fredrikstad, Norway Frode Brakstad, Norsk Hydro F-Center, Porsgrunn, Norway Rolf Carlson, Department of Chemistry, University of Tromsø, Norway Chevron Research & Technology Co, Richmond, CA, USA Lennart Eriksson, Dept. of Organic Chemistry, University of Umeå, Sweden (now with Umetrics, Inc.) Professor Magni Martens, The Royal Vetarinary & Agricultural University, Denmark Geological Survey of Greenland, Denmark IKU, Institute for Petroleum Research, Trondhein, Norway Norwegian Food Research Institute (MATFORSK), Ås, Norway Norwegian Society of Process Control Norwegian Chemometrics Society International Chemometrics Society UOP Guided Wave, CA, USA Pierre Gy, Cannes, France (for a gentleman s introduction to the finest French wines) Zander & Ingerstrõm, Oslo, Norway Tomas Õberg Konsult AB, Karlskoga, Sweden KAPITAL (weekly Norwegian economic magazine), no 14/1994, p50-55 Hlif Sigurjonsdottir, Reykjavik, Iceland (owner of G. Sgarabotto violin no 9 ) Birgitta Spur, LSO, Reykjavik, Iceland (permission to use the Sgarabotto oeuvre data) Sensorteknikk A/S, Bærum, Oslo (Bjørn Hope: sensor technology entrepreneur extraordinaire; Evy: for innumerable occasions: warm company, coffee and waffles, waffles, waffles) Thorbjørn T. Lied, Maths Halstensen, Tore Gravermoen, Rune Mathisen a.o. (for enormous help in developing acoustic chemometrics) Anonymous wine importer, Odense, Denmark. Helpful wine assessors (partly anonymous), Manson, Wa, USA. Finally the author(s) and CAMO wish to thank all THE UNSCRAMBLER users during the last seven years for their close relationships with us, which have given us so much added experience in

x Preface teaching multivariate data analysis. And thanks for all the constructive criticism to the earlier editions of this book. Last, but certainly not least, a warm thank you to all the students at HIT/TF, at Ålborg University, Esbjerg and many, many others, who have been associated with the teachings of the authors, nearly all of whom have been very constructive in their ongoing criticism of the entire teaching system embedded in this training package. We even learned from the occasional not-so-friendly criticisms Communication The period of seven years that has been the formative period for the training package has come of age. By now we are actually beginning to be rather satisfied with it! And yet: The author(s) and CAMO always welcome all critical responses to the present text. They are seriously needed in order for this work to be continually improving.

Contents xi Contents 1. Introduction to Multivariate Data Analysis - Overview 1 1.1 Indirect Observations and Correlation 1 1.2 Hidden Data Structures 7 1.3 Multivariate Data Analysis vs. Multivariate Statistics 9 1.4 Main Objectives of Multivariate Data Analytical Techniques 9 1.5 Multivariate Techniques as Projections 11 2. Getting Started - with Descriptive Statistics 13 2.1 Purpose 13 2.2 Data Set 1: Quality of Green Peas 13 2.3 Data set 2: Economic Characteristics of Car Dealerships in Norway 17 3. Principal Component Analysis (PCA) Introduction 19 3.1 Representing the Data as a Matrix 19 3.2 The Variable Space - Plotting Objects in p Dimensions 20 3.3 Plotting Objects in Variable Space 21 3.3.1 Exercise - Plotting Raw Data (People) 22 3.4 The First Principal Component 27 3.5 Extension to Higher-Order Principal Components 30 3.6 Principal Component Models - Scores and Loadings 31 3.6.1 Model Center 32 3.6.2 Loadings - Relations Between X and PCs 33 3.6.3 Scores - Coordinates in PC Space 34 3.6.4 Object Residuals 35 3.7 Objectives of PCA 35 3.8 Score Plot - Map of Samples 36 3.9 Loading Plot - Map of Variables 40

xii Contents 3.10 Exercise: Plotting and Interpreting a PCA-Model (People) 47 3.11 PC-Models 54 3.11.1 The PC Model: X = TP T + E = Structure + Noise 54 3.11.2 Residuals - The E-Matrix 58 3.11.3 How Many PCs to Use? 61 3.11.4 Variable Residuals 64 3.11.5 More about Variances - Modeling Error Variance 65 3.12 Exercise - Interpreting a PCA Model (Peas) 66 3.13 Exercise - PCA Modeling (Car Dealerships) 68 3.14 PCA Modeling The NIPALS Algorithm 72 4. Principal Component Analysis (PCA) - In Practice 75 4.1 Scaling or Weighting 75 4.2 Outliers 78 4.2.1 Scaling, Transformation and Normalization are Highly Problem Dependent Issues 80 4.3 PCA Step by Step 81 4.3.1 The Unscrambler and PCA 84 4.4 Summary of PCA 85 4.4.1 Interpretation of PCA-Models 88 4.4.2 Interpretation of Score Plots Look for Patterns 89 4.4.3 Summary - Interpretation of Score Plots 93 4.4.4 Summary - Interpretation of Loading Plots 94 4.5 PCA - What Can Go Wrong? 95 4.6 Exercise - Detecting Outliers (Troodos) 97 5. PCA Exercises Real-World Application Examples 105 5.1 Exercise - Find Clusters (Iris Species Discrimination) 105 5.2 Exercise - PCA for Experimental Design (Lewis Acids) 107 5.3 Exercise - Mud Samples 109 5.4 Exercise - Scaling (Troodos) 112 6. Multivariate Calibration (PCR/PLS) 115 6.1 Multivariate Modeling (X,Y): The Calibration Stage 115 6.2 Multivariate Modeling (X, Y): The Prediction Stage 116 6.3 Calibration Set Requirements (Training Data Set) 118 6.4 Introduction to Validation 120 6.5 Number of Components (Model Dimensionality) 122 6.6 Univariate Regression (y x) and MLR 124

Contents xiii 6.6.1 Univariate Regression (y x) 124 6.6.2 Multiple Linear Regression, MLR 125 6.7 Collinearity 127 6.8 PCR - Principal Component Regression 128 6.8.1 Exercise - Interpretation of Jam (PCR) 130 6.8.2 Weaknesses of PCR 136 6.9 PLS- Regression (PLS-R) 137 6.9.1 PLS - A Powerful Alternative to PCR 137 6.9.2 PLS (X,Y): Initial Comparison with PCA(X), PCA(Y) 137 6.9.3 PLS2 NIPALS Algorithm 139 6.9.4 Interpretation of PLS Models 143 6.9.5 The PLS1 NIPALS Algorithm 144 6.9.6 Exercise - Interpretation of PLS1 (Jam) 145 6.9.7 Exercise - Interpretation PLS2 (Jam) 147 6.10 When to Use which Method? 149 6.10.1 Exercise - Compare PCR and PLS1 (Jam) 150 6.11 Summary 153 7. Validation: Mandatory Performance Testing 155 7.1 The Concept of Test Set Validation 155 7.1.1 Calculating the Calibration Variance (Modeling Error) 157 7.1.2 Calculating the Validation Variance (Prediction Error) 158 7.1.3 Studying the Calibration and Validation Variances 159 7.2 Requirements for the Test Set 161 7.3 Cross Validation 163 7.4 Leverage Corrected Validation 168 8. How to Perform PCR and PLS-R 171 8.1 PLS and PCR - Step by Step 171 8.2 Optimal Number of Components in Modeling 172 8.3 Information in Later PCs 173 8.4 Exercises on PLS and PCR: the Heart-of-the-Matter! 173 8.4.1 Exercise - PLS2 (Peas) 174 8.4.2 Exercise - PLS1 or PLS2? (Peas) 177 8.4.3 Exercise - Is PCR better than PLS? (Peas) 179 9. Multivariate Data Analysis in Practice: Miscellaneous Issues 181 9.1 Data Constraints 181

xiv Contents 9.1.1 Data Matrix Dimensions 183 9.1.2 Missing Data 183 9.2 Data Collection 184 9.2.1 Use Historical Data 184 9.2.2 Monitoring Data from an On-Going Process 185 9.2.3 Data Generated by Planned Experiments 185 9.2.4 Perform Experiments or Collect Data - Always by Careful Reflection 186 9.2.5 The Random Design A Powerful Alternative 187 9.3 Selecting from Abundant Data 188 9.3.1 Selecting a Calibration Data Set from Abundant Training Data 188 9.3.2 Selecting a Validation Data Set 189 9.4 Error Sources 190 9.5 Replicates - A Means to Quantify Errors 190 9.6 Estimates of Experimental - and Measurement Errors 191 9.6.1 Error in Y (Reference Method): Reproducibility 192 9.6.2 Stability over Consecutive Measurements: Repeatability 193 9.7 Handling Replicates in Multivariate Modeling 195 9.8 Validation in Practice 198 9.8.1 Test Set 198 9.8.2 Cross Validation 198 9.8.3 Leverage Correction 199 9.8.4 The Multivariate Model Validation Alternatives 199 9.9 How Good is the Model: RMSEP and Other Measures 200 9.9.1 Residuals 200 9.9.2 Residual Variances (Calibration, Prediction) 201 9.9.3 Correction for Degrees of Freedom 203 9.9.4 RMSEP and RMSEC - Average, Representative Errors in Original Units 203 9.9.5 RMSEP, SEP and Bias 205 9.9.6 Comparison Between Prediction Error and Measurement Error 206 9.9.7 Compare RMSEP for Different Models 207 9.9.8 Compare Results with Other Methods 207 9.9.9 Other Measures of Errors 208 9.10 Prediction of New Data 209 9.10.1 Getting Reliable Prediction Results 209 9.10.2 How Does Prediction Work? 209 9.10.3 Prediction Used as Validation 210

Contents xv 9.10.4 Uncertainty at Prediction 210 9.10.5 Study Prediction Objects and Training Objects in the Same Plot 211 9.11 Coding Category Variables: PLS-DISCRIM 211 9.12 Scaling or Weighting Variables 213 9.13 Using the B- and the Bw-Coefficients 214 9.14 Calibration of Spectroscopic Data 215 9.14.1 Spectroscopic Data: Calibration Options 216 9.14.2 Interpretation of Spectroscopic Calibration Models 217 9.14.3 Choosing Wavelengths 219 10. PLS (PCR) Exercises: Real-World Application Examples - I 221 10.1 Exercise - Prediction of Gasoline Octane Number 221 10.2 Exercise - Water Quality 230 10.3 Exercise - Freezing Point of Jet Fuel 233 10.4 Exercise - Paper 236 11. PLS (PCR) Multivariate Calibration In Practice 241 11.1 Outliers and Subgroups 242 11.1.1 Scores 242 11.1.2 X-Y Relation Outlier Plots (T vs. U Scores) 244 11.1.3 Residuals 245 11.1.4 Dangerous Outliers or Interesting Extremes? 246 11.2 Systematic Errors 248 11.2.1 Y-Residuals Plotted Against Objects 249 11.2.2 Residuals Plotted Against Predicted Values 249 11.2.3 Normal Probability Plot of Residuals 251 11.3 Transformations 252 11.3.1 Logarithmic Transformations 253 11.3.2 Spectroscopic Transformations 254 11.3.3 Multiplicative Scatter Correction 256 11.3.4 Differentiation 259 11.3.5 Averaging 259 11.3.6 Normalization 259 11.4 Non-Linearities 260 11.4.1 How to Handle Non-Linearities? 262 11.4.2 Deleting Variables 263 11.5 Procedure for Refining Models 264

xvi Contents 11.6 Precise Measurements vs. Noisy Measurements 265 11.7 How to Interpret the Residual Variance Plot 267 11.8 Summary: The Unscrambler Plots Revealing Problems 270 12. PLS (PCR) Exercises: Real-World Applications - II 273 12.1 Exercise ~ Log-Transformation (Dioxin) 273 12.2 Exercise - Multiplicative Scatter Correction (Alcohol) 276 12.3 Exercise Dirty Data (Geologic Data with Severe Uncertainties) 284 12.4 Exercise - Spectroscopy Calibration (Wheat) 291 12.5 Exercise QSAR (Cytotoxicity) 293 13. Master Data Sets: Interim Examination 303 13.1 Sgarabotto Master Violin Data Set 305 13.2 Norwegian Car Dealerships - Revisited 313 13.3 Vintages 317 13.4 Acoustic Chemometrics (a. c.) 321 14. Uncertainty Estimates, Significance and Stability (Martens Uncertainty Test) 327 14.1 Uncertainty Estimates in Regression Coefficients, b 327 14.2 Rotation of Perturbed Models 328 14.3 Variable Selection 329 14.4 Model Stability 330 14.4.1 Introduction 330 14.4.2 An Example Using the Paper Data 330 14.5 Exercise - Paper - Uncertainty Test and Model Stability 332 15. SIMCA: An Introduction to Classification 335 15.1 SIMCA - Fields of Use 339 15.2 How to Make SIMCA Class-Models? 340 15.2.1 Basic SIMCA Steps: A Standard Flow-Sheet 340 15.3 How Do we Classify new Samples? 341 15.4 Classification Results 341 15.4.1 Statistical Significance Level and its Use: An Introduction 342 15.5 Graphical Interpretation of Classification Results 344 15.5.1 The Coomans Plot 344 15.5.2 The Si vs. Hi Plot (Distance vs. Leverage) 345

Contents xvii 15.5.3 Si/S0 vs. Hi 347 15.5.4 Model Distance 348 15.5.5 Variable Discrimination Power 349 15.5.6 Modeling Power 350 15.6 SIMCA-Exercise IRIS Classification 351 16. Introduction to Experimental Design 361 16.1 Experimental Design 361 16.2 Screening Designs 375 16.2.1 Full Factorial Designs 376 16.2.2 Fractional Factorial Designs 378 16.2.3 Plackett-Burman Designs 382 16.3 Analyzing a Screening Design 383 16.3.1 Significant effects 386 16.3.2 Using F-Test and P-Values to Determine Significant Effects 387 16.3.3 Exercise - Willgerodt-Kindler Reaction 391 16.4 Optimization Designs 395 16.4.1 Central Composite Designs 396 16.4.2 Box-Behnken Designs 400 16.5 Analyzing an Optimization Design 402 16.5.1 Exercise - Optimization of Enamine Synthesis 403 16.6 Practical Aspects of Making an Experimental Design 414 16.7 Extending a Design 428 16.8 Validation of Designed Data Sets 430 16.9 Problems in Designed Data Sets 431 16.9.1 Detect and Interpret Effects 433 16.9.2 How to Separate Confounded Effects? 436 16.9.3 Blocking and Repeated Response Measurements 436 16.9.4 Fold-Over Designs 438 16.9.5 What Do We Do if We Cannot Keep to the Planned Variable Settings? 439 16.9.6 A Random Design 440 16.9.7 Modeling Uncoded Data 440 16.10 Exercise - Designed Data with Non-Stipulated Values (Lacotid) 441 16.11 Experimental Design Procedure in The Unscrambler 444 17. Complex Experimental Design Problems 447

xviii Contents 17.1 Introduction to Complex Experimental Design Problems 447 17.1.1 Constraints Between the Levels of Several Design Variables 447 17.1.2 A Special Case: Mixture Situations 450 17.1.3 Alternative Solutions 451 17.2 The Mixture Situation 455 17.2.1 An Example of Mixture Design 455 17.2.2 Screening Designs for Mixtures 457 17.2.3 Optimization Designs for Mixtures 460 17.2.4 Designs that Cover a Mixture Region Evenly 461 17.3 How To Deal With Constraints 463 17.3.1 Introduction to the D-Optimal Principle 463 17.3.2 Non-Mixture D-Optimal Designs 466 17.3.3 Mixture D-Optimal Designs 467 17.3.4 Advanced Topics 469 17.4 How To Analyze Results From Constrained Experiments 474 17.4.1 Use of PLS Regression For Constrained Designs 474 17.4.2 Relevant Regression Models 476 17.4.3 The Mixture Response Surface Plot 478 17.5 Exercise ~ Build a Mixture Design - Wines 479 18. Comparison of Methods for Multivariate Data Analysis - And their Validation 489 18.1 Comparison of Selected Multivariate Methods 489 18.1.1 Principal Component Analysis (PCA) 490 18.1.2 Factor Analysis (FA) 492 18.1.3 Cluster Analysis (CA) 494 18.1.4 Linear Discriminant Analysis (LDA) 496 18.1.5 Comparison: Projection Dimensionality in Multivariate Data Analysis 498 18.1.6 Multiple Linear Regression, (MLR) 498 18.1.7 Principal Component Regression (PCR) 499 18.1.8 Partial Least Squares Regression (PLS-R) 500 18.1.9 Increasing Projection Dimensionality in Regression Modeling 501 18.2 Choosing Multivariate Methods Is Not Optional! 501 18.2.1 Problem Formulation 501 18.3 Unsupervised Methods 502 18.4 Supervised Methods 503

Contents xix 18.5 A Final Discussion about Validation 505 18.5.1 Test Set Validation 505 18.5.2 Cross Validation 506 18.5.3 Leverage Corrected Validation 508 18.5.4 Selecting a Validation Approach in Practice 509 18.6 Summary of Basic Rules for Success 510 18.7 From Here You Are on Your Own. Good Luck! 511 19. Literature 513 20. Appendix: Algorithms 519 20.1 PCA 519 20.2 PCR 520 20.3 PLS1 521 20.4 PLS2 524 21. Appendix: Software Installation and User Interface 527 21.1 Welcome to The Unscrambler 527 21.2 How to Install and Configure The Unscrambler 527 21.3 Problems You Can Solve with The Unscrambler 529 21.4 The Unscrambler Workplace 530 21.4.2 The Editor 532 21.4.3 The Viewer 534 21.4.4 Dockable Views 537 21.4.5 Dialogs 537 21.4.6 The Help System 539 21.4.7 Tooltips 540 21.5 Using The Unscrambler Efficiently 540 21.5.1 Analyses 540 21.5.2 Some Tips to Make Your Work Easier 545 Glossary of Terms 549 Index 587