DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA



Similar documents
Principles of Data Mining by Hand&Mannila&Smyth

An Introduction to Data Mining

Machine Learning and Data Mining. Fundamentals, robotics, recognition

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Applications of New, High Intensity X-Ray Optics - Normal and thin film diffraction using a parabolic, multilayer mirror

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Data Mining Part 5. Prediction

Predictive Dynamix Inc

Clustering Connectionist and Statistical Language Processing

DATA MINING TECHNIQUES AND APPLICATIONS

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Towards applying Data Mining Techniques for Talent Mangement

Data Mining Solutions for the Business Environment

Data quality in Accounting Information Systems

A Comparative Study of clustering algorithms Using weka tools

A New Approach for Evaluation of Data Mining Techniques

An Introduction to WEKA. As presented by PACE

Data Mining + Business Intelligence. Integration, Design and Implementation

DEA implementation and clustering analysis using the K-Means algorithm

Comparison of K-means and Backpropagation Data Mining Algorithms

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction Predictive Analytics Tools: Weka

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining Applications in Higher Education

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining and Soft Computing. Francisco Herrera

Statistics for BIG data

Social Media Mining. Data Mining Essentials

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

The Scientific Data Mining Process

Simple Predictive Analytics Curtis Seare

Subject Description Form

Pentaho Data Mining Last Modified on January 22, 2007

Introduction to Machine Learning Using Python. Vikram Kamath

Meta-learning. Synonyms. Definition. Characteristics

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Using Data Mining for Mobile Communication Clustering and Characterization

Chapter 20: Data Analysis

Visualizing class probability estimators

Data Mining and Visualization

Knowledge Discovery from patents using KMX Text Analytics

Weather forecast prediction: a Data Mining application

Web Document Clustering

Data mining and official statistics

Quantifying Amorphous Phases. Kern, A., Madsen, I.C. and Scarlett, N.V.Y.

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

Visualization of large data sets using MDS combined with LVQ.

Database Marketing, Business Intelligence and Knowledge Discovery

Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application in the DJ Basin.

The Predictive Data Mining Revolution in Scorecards:

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Random forest algorithm in big data environment

MSCA Introduction to Statistical Concepts

MS1b Statistical Data Mining

Big Data: Rethinking Text Visualization

How To Predict Web Site Visits

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

College of Health and Human Services. Fall Syllabus

Ensemble Data Mining Methods

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Statistical Models in Data Mining

Support Vector Machines with Clustering for Training with Very Large Datasets

Benchmarking of different classes of models used for credit scoring

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Three Perspectives of Data Mining

King Saud University

Introduction to X-Ray Powder Diffraction Data Analysis

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Learning is a very general term denoting the way in which agents:

not possible or was possible at a high cost for collecting the data.

Chapter 12 Discovering New Knowledge Data Mining

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

An Overview of Knowledge Discovery Database and Data mining Techniques

Prediction of Stock Performance Using Analytical Techniques

Machine Learning Introduction

Data Mining and Neural Networks in Stata

Predictive Modeling and Big Data

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Predicting Student Performance by Using Data Mining Methods for Classification

On the effect of data set size on bias and variance in classification learning

Customer Classification And Prediction Based On Data Mining Technique

Data Mining for Fun and Profit

Neural Networks in Data Mining

Healthcare Measurement Analysis Using Data mining Techniques

An Introduction to Advanced Analytics and Data Mining

Transcription:

315 DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA C. J. Gilmore, R. J. Thatcher Chemistry Department, University of Glasgow, Glasgow, Scotland, UK W. Sverdlik Department of Computer Science, Eastern Michigan University, Ypsilanti, Michigan, USA Abstract High-Throughput Materials Discovery uses automation and parallelism to synthesize and evaluate large numbers of specimens while reducing time and costs associated with finding and optimizing novel materials. As optimal performance may not be uniformly distributed throughout parameter space, efficient tools for analyzing data and evaluating large areas of compositional or parameter space are needed. Data mining tools enable moving from the statistics of limited experimental designs to more descriptive and predictive relationships. Clustering a set of 47 samples for which both X-ray powder diffraction data and X-ray fluorescence-based elemental composition data were available showed that elemental composition correlated strongly with phase composition in this particular set of samples. Also, the clustering of the X-ray data was found to be exactly coincident with a different sample characteristic "type". Decision tree classification of a larger data set of 86 samples showed that "type" could be defined with very few errors from relatively few splits of the XRF-based compositions. Although composition exhibited strong clustering, measures of performance in these same samples exhibited only very weak clustering. However, performance of the materials could be predicted from linear regression using different slices of the data. Neural nets were attempted for improved predictability of performance beyond linear regression. As expected from the liner regression results, single output linear-based multi-layer perceptrons yielded acceptable predictive capability, but were found to yield notably degraded predictive results if "type" was excluded from the models. The strong dependence of performance on "type" for these samples was an unexpected outcome of the data analysis. Introduction High-Throughput Materials Discovery makes use of automated instrumentation and parallelism to synthesize and test large numbers of specimens (Figure 1). The foundation of this approach is that more can be learned from experiments on a widely diverse set of specimens than from complex, detailed measurements on simple systems or on measurements of a limited number of samples. Automated instrumentation and large numbers of specimens implies that large amounts of data will be generated, implying a strong need for efficient methods in evaluating data from large areas of compositional or parameter space. Although standard experimental design (DOE) statistical tools can provide a basis for selecting parameters and interpreting results, DOE tools are inherently constrained to the parameter space examined. We would like to take knowledge gleaned from a wide diversity of specimens,

This document was presented at the Denver X-ray Conference (DXC) on Applications of X-ray Analysis. Sponsored by the International Centre for Diffraction Data (ICDD). This document is provided by ICDD in cooperation with the authors and presenters of the DXC for the express purpose of educating the scientific community. All copyrights for the document are retained by ICDD. Usage is restricted for the purposes of education and scientific research. DXC Website www.dxcicdd.com ICDD Website - www.icdd.com

316 describe our knowledge about these specimens, and develop predictions about regions in parameter space where further studies would be warranted. Describing data and developing predictions falls in the realm of data mining. Instead of the inward deductive data focus of DOE and statistical analysis tools, data mining emphasizes learning from examples and extrapolating to more general descriptive or predictive models through the use of a variety of artificial intelligence, pattern recognition, and machine learning algorithms. Effective data mining is all about how to formulate questions that are meaningful or sensible and how to prepare data to correctly answer those questions. Unfortunately, no general recipes exist for designing good questions nor for preparing data, especially scientific data, although some useful general references are available. 1,2 Types of standard data mining algorithms that might be used to answer questions are listed in Table 1. In this paper, clustering, regression, decision classification trees, and neural nets were used to examine relationships in a dataset that contained both quantitative X-ray fluorescence compositions and X-ray powder diffraction data. Design Experiment (DoE Tools) Data Reduction and Data Mining Database Robotic Synthesis Parallel Screening Figure 1. Ford Motor Company implementation of High-Throughput Materials Discovery Results and Discussion As previously mentioned, one of the biggest challenges in data mining is data preparation. Although many vendors offer very capable software for handling X-ray powder diffraction data, we developed a fully automated empirical algorithm for background subtraction using Python. The algorithm (Equation 1) uses a 6-parameter fit with complex non-linear weighting but requires only a single input parameter from the user specifying an estimate of where background is relative to the last few points at the high-angle end of the scan. The algorithm fits both the low-angle scatter arising from powder surface roughness and the flat background expected at higher angles from off-axis-cut zero-background quartz substrates (Figure 2). Minimization is achieved using a Nelder-Mead simplex. 2 x 1 1 1 a6 2 + a3 + a 2 4 a 3 5e y = a1 + a + (Eqn. 1) x x x The advantage of using this algorithm for background subtraction is that all diffraction scans are treated the same and a very large number of data files can be handled very efficiently by listing the filenames in a batch run file. Following background subtraction, the X-ray powder

317 diffraction data can then be further processed. For the analyses described below, the X-ray powder diffraction data were subsequently processed using PolySNAP. 3,4 Table 1. Types of Data Mining Algorithms Regression Descriptive Data Classification Models (numerical data) Visualization Other Linear and multiple Statistical exploratory Market basket analyses, a Version space hypotheses regression data analysis priori algorithms Regression and model trees Decision trees and lists Hierarchical clustering Textual analyses Adaptive neural nets, Image analysis and Instance-based classifiers K-means clustering multilayer nets segmentation Genetic algorithms Perceptron neural nets Expectation Maximization clustering Genetic algorithms Bayesian inference Figure 2. Two X-ray powder diffraction scans showing the effectiveness of the new algorithm in fitting a background. The red line is the fitted background, y in Equation 1. The X-ray powder diffraction data were obtained with either a PAD-V or an X2 Scintag powder diffractometer equipped with a copper-target X-ray tube. Data were collected with continuous scans and electronic integration over 0.03 2θ. The X-ray fluorescence data were obtained with a Philips PW2400 with a chromium tube using UniQuant5 and sensitivities optimized using additional in-house calibration standards and with background channels customized to better handle the chemistries of these samples. The resulting output of oxide weight percentages was converted to moles of each element. The data were prepared such that relationships between phase composition, elemental composition, and performance could be examined. Merging data from different characterization techniques yielded two data sets: (1) a set of 47 samples with X-ray powder diffraction (XRD) data, elemental compositions from X-ray fluorescence (XRF), surface area, and four measures of performance; (2) a related data set containing 86 samples with XRF data, surface area, a parameter for history (sample aging), and four measures of performance but without XRD data.

318 Data sets (1) and (2) were initially examined for natural groupings in the data with clustering. STATISTICA 5 was used for hierarchical clustering of the XRF, surface area, and performance data. Similarity clustering of the XRF, surface area, and performance data in various combinations with the XRD data was accomplished with the three-way multidimensional scaling of PolySNAP. 6,7 For more predictive models, regression and decision tree classification were accomplished with the open-source software WEKA 6. Neural nets were developed using STATISTICA Neural Nets. (a) (b) Figure 3. (a) The clustering of the XRPD data in data set (1) by multi-dimensional scaling in PolySNAP. (b) The clustering of the XRF data in data set (1) also by multi-dimensional scaling. Although difficult to see in these images, the cluster membership is exactly the same for both types of data. (a) Figure 4. From PolySNAP using data set (1), similarity clustering of a subset of the elemental data from XRF (a) without surface area and (b) with surface area included. (b)

319 As illustrated in Figure 3, cluster membership is found to be the same for both types of X-ray data, XRD and XRF. Therefore, the phase composition has a strong relationship to elemental composition. Different variations in specimen composition are related to the presence of different phases. Examination of the cluster membership shows that the members accurately reflect a descriptor sample type that was derived from other information unrelated to any chemical or characterization measurements, e.g., sample type reflects the source from which the chemicals originated. Figure 5. Similarity clustering from PolySNAP that results from adding XRD data to surface area and a subset of XRF elemental data (data set 1). Our knowledge of the samples tells us that not all of the elemental composition should be related to sample type. Manually selecting a subset of the XRF data enables probing relationships beyond the influence of sample type. However, the subset of XRF data exhibits relatively weak clustering (Figure 4a). Including surface area with the XRF data changes the clustering membership (Figure 4b) but does not strengthen the relationships. Hierarchical clustering using complete Euclidean linkage distances for the same subset of XRF data but from the larger data set (2) of 86 samples still yields poor clustering with very small linkage distances. However, inclusion of surface area in the hierarchical clustering of the larger data set does yield more numerically significant linkage distances and more distinct clusters. Not surprisingly, because the XRD data contain information so strongly related to sample type (Figure 3a), the addition of XRD clustering to the XRF subset-surface area clustering imposes a more definite structure in the overall clustering (Figure 5). Nevertheless, surface area and the XRF subset of data influences the cluster membership compared with Figure 3a. Examination of the clustering relationships for the four measures of performance indicates that the performance data alone show no strong tendency to cluster. The larger data set (2) of 86 samples but without XRD data was used to test for efficacy in predicting performance. For the prediction model building, selecting amongst the possible

320 twenty-one primary variables, thirteen derived variables, and four response variables was accomplished by either the independent feature selection heuristic of STATISTICA Data Miner or by using in each technique the embedded algorithms that selectively add or subtract parameters. Rather surprisingly, linear regression models for all four measures of performance could be found with correlation coefficients ranging from 0.84 to 0.96. 7 Different combinations of XRF elements, surface area, and history parameter yield statistically comparable models although all models included sample type. Decision tree classification shows that sample type can be defined with very few errors from relatively few splits of the XRF-based compositions, which is consistent with the clustering observed using PolySNAP (Figure 3b). To examine the influence of sample type on the regression models, sample type and elements defining sample type were excluded, but the history parameter and various combinations of surface area with remaining XRF elements were included. Nevertheless, the correlation coefficients for the linear regression models dropped significantly to 0.77 to 0.82. This leads us to conclude that the measures of performance that were tested do depend to some extent on aging history of the specimens, surface area, and other aspects of composition besides sample type, but that for these particular materials, sample type is a significant factor related to the performance of the materials. Predictive models developed using neural nets show the same trend; predictions are notably degraded without the inclusion of sample type. The predictive capabilities of neural net models are further degraded if multiple predictions are attempted. This may suggest that the parameters remaining after removing sample type may be only weakly related to performance and may be insufficiently independent to successfully predict material performance. Conclusions XRD phase composition and XRF elemental composition were found to yield the same clustering and, hence, both types of X-ray data have a strong relationship to each other in the specimens examined. Cluster membership of the X-ray data was found to be indicative of an unrelated descriptor sample type. Models developed for these data sets needed the inclusion of sample type to be effective in predicting performance. Although the dependence on sample type is, perhaps, not surprising in retrospect, models independent of sample type would be more useful. Hence, the next step for extending our data mining is to find other descriptors that improve prediction of performance without requiring the inclusion of sample type in the model. Dimensionality reduction of spectral-type X-ray data may yield other descriptors useful for modeling performance. Improved predictive models would guide us to other regions in parameter space in which to search for new or optimized materials.

321 References 1 Data Mining, Ian Witten and Eibe Frank (2000); Machine Learning, Tom Mitchell (1997); Data Mining: Concepts and Techniques, J. Han and M. Kamber (2001); Data Mining: Concepts, Models, Methods, and Algorithms, Mehmed Kantardzic (2003). 2 Data Preparation for Data Mining, Dorian Pyle (1999). 3 PolySNAP, Brucker AXS; also G. Barr, W. Dong, C.J. Gilmore (2004). PolySNAP: a computer program for analysing high-throughput powder diffraction data, J. Appl. Cryst. 37, 658. 4 C.J. Gilmore, G. Barr, J. Paisley (2004). High-throughput powder diffraction. I. A new approach to qualitative and quantitative powder diffraction pattern analysis using full pattern profiles, J. Appl. Cryst. 37, 231; G. Barr, W. Dong, D.J. Gilmore (2004). High-throughput powder diffraction. II. Applications of clustering methods and multivariate data analysis, J. Appl. Cryst. 37, 243. 5 StatSoft, Inc. (2005). STATISTICA 7.1 or STATISTICA Data Miner, version 7. www.statsoft.com. 6 Ian H. Witten and Eibe Frank (2005). "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco; see also Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/. 7 Using WEKA s Greedy algorithm for linear regression models with the outlier (sample P31) removed.