DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

315 DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA C. J. Gilmore, R. J. Thatcher Chemistry Department, University of Glasgow, Glasgow, Scotland, UK W. Sverdlik Department of Computer Science, Eastern Michigan University, Ypsilanti, Michigan, USA Abstract High-Throughput Materials Discovery uses automation and parallelism to synthesize and evaluate large numbers of specimens while reducing time and costs associated with finding and optimizing novel materials. As optimal performance may not be uniformly distributed throughout parameter space, efficient tools for analyzing data and evaluating large areas of compositional or parameter space are needed. Data mining tools enable moving from the statistics of limited experimental designs to more descriptive and predictive relationships. Clustering a set of 47 samples for which both X-ray powder diffraction data and X-ray fluorescence-based elemental composition data were available showed that elemental composition correlated strongly with phase composition in this particular set of samples. Also, the clustering of the X-ray data was found to be exactly coincident with a different sample characteristic "type". Decision tree classification of a larger data set of 86 samples showed that "type" could be defined with very few errors from relatively few splits of the XRF-based compositions. Although composition exhibited strong clustering, measures of performance in these same samples exhibited only very weak clustering. However, performance of the materials could be predicted from linear regression using different slices of the data. Neural nets were attempted for improved predictability of performance beyond linear regression. As expected from the liner regression results, single output linear-based multi-layer perceptrons yielded acceptable predictive capability, but were found to yield notably degraded predictive results if "type" was excluded from the models. The strong dependence of performance on "type" for these samples was an unexpected outcome of the data analysis. Introduction High-Throughput Materials Discovery makes use of automated instrumentation and parallelism to synthesize and test large numbers of specimens (Figure 1). The foundation of this approach is that more can be learned from experiments on a widely diverse set of specimens than from complex, detailed measurements on simple systems or on measurements of a limited number of samples. Automated instrumentation and large numbers of specimens implies that large amounts of data will be generated, implying a strong need for efficient methods in evaluating data from large areas of compositional or parameter space. Although standard experimental design (DOE) statistical tools can provide a basis for selecting parameters and interpreting results, DOE tools are inherently constrained to the parameter space examined. We would like to take knowledge gleaned from a wide diversity of specimens,

This document was presented at the Denver X-ray Conference (DXC) on Applications of X-ray Analysis. Sponsored by the International Centre for Diffraction Data (ICDD). This document is provided by ICDD in cooperation with the authors and presenters of the DXC for the express purpose of educating the scientific community. All copyrights for the document are retained by ICDD. Usage is restricted for the purposes of education and scientific research. DXC Website www.dxcicdd.com ICDD Website - www.icdd.com

316 describe our knowledge about these specimens, and develop predictions about regions in parameter space where further studies would be warranted. Describing data and developing predictions falls in the realm of data mining. Instead of the inward deductive data focus of DOE and statistical analysis tools, data mining emphasizes learning from examples and extrapolating to more general descriptive or predictive models through the use of a variety of artificial intelligence, pattern recognition, and machine learning algorithms. Effective data mining is all about how to formulate questions that are meaningful or sensible and how to prepare data to correctly answer those questions. Unfortunately, no general recipes exist for designing good questions nor for preparing data, especially scientific data, although some useful general references are available. 1,2 Types of standard data mining algorithms that might be used to answer questions are listed in Table 1. In this paper, clustering, regression, decision classification trees, and neural nets were used to examine relationships in a dataset that contained both quantitative X-ray fluorescence compositions and X-ray powder diffraction data. Design Experiment (DoE Tools) Data Reduction and Data Mining Database Robotic Synthesis Parallel Screening Figure 1. Ford Motor Company implementation of High-Throughput Materials Discovery Results and Discussion As previously mentioned, one of the biggest challenges in data mining is data preparation. Although many vendors offer very capable software for handling X-ray powder diffraction data, we developed a fully automated empirical algorithm for background subtraction using Python. The algorithm (Equation 1) uses a 6-parameter fit with complex non-linear weighting but requires only a single input parameter from the user specifying an estimate of where background is relative to the last few points at the high-angle end of the scan. The algorithm fits both the low-angle scatter arising from powder surface roughness and the flat background expected at higher angles from off-axis-cut zero-background quartz substrates (Figure 2). Minimization is achieved using a Nelder-Mead simplex. 2 x 1 1 1 a6 2 + a3 + a 2 4 a 3 5e y = a1 + a + (Eqn. 1) x x x The advantage of using this algorithm for background subtraction is that all diffraction scans are treated the same and a very large number of data files can be handled very efficiently by listing the filenames in a batch run file. Following background subtraction, the X-ray powder

317 diffraction data can then be further processed. For the analyses described below, the X-ray powder diffraction data were subsequently processed using PolySNAP. 3,4 Table 1. Types of Data Mining Algorithms Regression Descriptive Data Classification Models (numerical data) Visualization Other Linear and multiple Statistical exploratory Market basket analyses, a Version space hypotheses regression data analysis priori algorithms Regression and model trees Decision trees and lists Hierarchical clustering Textual analyses Adaptive neural nets, Image analysis and Instance-based classifiers K-means clustering multilayer nets segmentation Genetic algorithms Perceptron neural nets Expectation Maximization clustering Genetic algorithms Bayesian inference Figure 2. Two X-ray powder diffraction scans showing the effectiveness of the new algorithm in fitting a background. The red line is the fitted background, y in Equation 1. The X-ray powder diffraction data were obtained with either a PAD-V or an X2 Scintag powder diffractometer equipped with a copper-target X-ray tube. Data were collected with continuous scans and electronic integration over 0.03 2θ. The X-ray fluorescence data were obtained with a Philips PW2400 with a chromium tube using UniQuant5 and sensitivities optimized using additional in-house calibration standards and with background channels customized to better handle the chemistries of these samples. The resulting output of oxide weight percentages was converted to moles of each element. The data were prepared such that relationships between phase composition, elemental composition, and performance could be examined. Merging data from different characterization techniques yielded two data sets: (1) a set of 47 samples with X-ray powder diffraction (XRD) data, elemental compositions from X-ray fluorescence (XRF), surface area, and four measures of performance; (2) a related data set containing 86 samples with XRF data, surface area, a parameter for history (sample aging), and four measures of performance but without XRD data.

318 Data sets (1) and (2) were initially examined for natural groupings in the data with clustering. STATISTICA 5 was used for hierarchical clustering of the XRF, surface area, and performance data. Similarity clustering of the XRF, surface area, and performance data in various combinations with the XRD data was accomplished with the three-way multidimensional scaling of PolySNAP. 6,7 For more predictive models, regression and decision tree classification were accomplished with the open-source software WEKA 6. Neural nets were developed using STATISTICA Neural Nets. (a) (b) Figure 3. (a) The clustering of the XRPD data in data set (1) by multi-dimensional scaling in PolySNAP. (b) The clustering of the XRF data in data set (1) also by multi-dimensional scaling. Although difficult to see in these images, the cluster membership is exactly the same for both types of data. (a) Figure 4. From PolySNAP using data set (1), similarity clustering of a subset of the elemental data from XRF (a) without surface area and (b) with surface area included. (b)

319 As illustrated in Figure 3, cluster membership is found to be the same for both types of X-ray data, XRD and XRF. Therefore, the phase composition has a strong relationship to elemental composition. Different variations in specimen composition are related to the presence of different phases. Examination of the cluster membership shows that the members accurately reflect a descriptor sample type that was derived from other information unrelated to any chemical or characterization measurements, e.g., sample type reflects the source from which the chemicals originated. Figure 5. Similarity clustering from PolySNAP that results from adding XRD data to surface area and a subset of XRF elemental data (data set 1). Our knowledge of the samples tells us that not all of the elemental composition should be related to sample type. Manually selecting a subset of the XRF data enables probing relationships beyond the influence of sample type. However, the subset of XRF data exhibits relatively weak clustering (Figure 4a). Including surface area with the XRF data changes the clustering membership (Figure 4b) but does not strengthen the relationships. Hierarchical clustering using complete Euclidean linkage distances for the same subset of XRF data but from the larger data set (2) of 86 samples still yields poor clustering with very small linkage distances. However, inclusion of surface area in the hierarchical clustering of the larger data set does yield more numerically significant linkage distances and more distinct clusters. Not surprisingly, because the XRD data contain information so strongly related to sample type (Figure 3a), the addition of XRD clustering to the XRF subset-surface area clustering imposes a more definite structure in the overall clustering (Figure 5). Nevertheless, surface area and the XRF subset of data influences the cluster membership compared with Figure 3a. Examination of the clustering relationships for the four measures of performance indicates that the performance data alone show no strong tendency to cluster. The larger data set (2) of 86 samples but without XRD data was used to test for efficacy in predicting performance. For the prediction model building, selecting amongst the possible

320 twenty-one primary variables, thirteen derived variables, and four response variables was accomplished by either the independent feature selection heuristic of STATISTICA Data Miner or by using in each technique the embedded algorithms that selectively add or subtract parameters. Rather surprisingly, linear regression models for all four measures of performance could be found with correlation coefficients ranging from 0.84 to 0.96. 7 Different combinations of XRF elements, surface area, and history parameter yield statistically comparable models although all models included sample type. Decision tree classification shows that sample type can be defined with very few errors from relatively few splits of the XRF-based compositions, which is consistent with the clustering observed using PolySNAP (Figure 3b). To examine the influence of sample type on the regression models, sample type and elements defining sample type were excluded, but the history parameter and various combinations of surface area with remaining XRF elements were included. Nevertheless, the correlation coefficients for the linear regression models dropped significantly to 0.77 to 0.82. This leads us to conclude that the measures of performance that were tested do depend to some extent on aging history of the specimens, surface area, and other aspects of composition besides sample type, but that for these particular materials, sample type is a significant factor related to the performance of the materials. Predictive models developed using neural nets show the same trend; predictions are notably degraded without the inclusion of sample type. The predictive capabilities of neural net models are further degraded if multiple predictions are attempted. This may suggest that the parameters remaining after removing sample type may be only weakly related to performance and may be insufficiently independent to successfully predict material performance. Conclusions XRD phase composition and XRF elemental composition were found to yield the same clustering and, hence, both types of X-ray data have a strong relationship to each other in the specimens examined. Cluster membership of the X-ray data was found to be indicative of an unrelated descriptor sample type. Models developed for these data sets needed the inclusion of sample type to be effective in predicting performance. Although the dependence on sample type is, perhaps, not surprising in retrospect, models independent of sample type would be more useful. Hence, the next step for extending our data mining is to find other descriptors that improve prediction of performance without requiring the inclusion of sample type in the model. Dimensionality reduction of spectral-type X-ray data may yield other descriptors useful for modeling performance. Improved predictive models would guide us to other regions in parameter space in which to search for new or optimized materials.

321 References 1 Data Mining, Ian Witten and Eibe Frank (2000); Machine Learning, Tom Mitchell (1997); Data Mining: Concepts and Techniques, J. Han and M. Kamber (2001); Data Mining: Concepts, Models, Methods, and Algorithms, Mehmed Kantardzic (2003). 2 Data Preparation for Data Mining, Dorian Pyle (1999). 3 PolySNAP, Brucker AXS; also G. Barr, W. Dong, C.J. Gilmore (2004). PolySNAP: a computer program for analysing high-throughput powder diffraction data, J. Appl. Cryst. 37, 658. 4 C.J. Gilmore, G. Barr, J. Paisley (2004). High-throughput powder diffraction. I. A new approach to qualitative and quantitative powder diffraction pattern analysis using full pattern profiles, J. Appl. Cryst. 37, 231; G. Barr, W. Dong, D.J. Gilmore (2004). High-throughput powder diffraction. II. Applications of clustering methods and multivariate data analysis, J. Appl. Cryst. 37, 243. 5 StatSoft, Inc. (2005). STATISTICA 7.1 or STATISTICA Data Miner, version 7. www.statsoft.com. 6 Ian H. Witten and Eibe Frank (2005). "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco; see also Weka 3: Data Mining Software in Java, http://www.cs.waikato.ac.nz/ml/weka/. 7 Using WEKA s Greedy algorithm for linear regression models with the outlier (sample P31) removed.