DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

Size: px
Start display at page:

Download "DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA"

Transcription

1 315 DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA C. J. Gilmore, R. J. Thatcher Chemistry Department, University of Glasgow, Glasgow, Scotland, UK W. Sverdlik Department of Computer Science, Eastern Michigan University, Ypsilanti, Michigan, USA Abstract High-Throughput Materials Discovery uses automation and parallelism to synthesize and evaluate large numbers of specimens while reducing time and costs associated with finding and optimizing novel materials. As optimal performance may not be uniformly distributed throughout parameter space, efficient tools for analyzing data and evaluating large areas of compositional or parameter space are needed. Data mining tools enable moving from the statistics of limited experimental designs to more descriptive and predictive relationships. Clustering a set of 47 samples for which both X-ray powder diffraction data and X-ray fluorescence-based elemental composition data were available showed that elemental composition correlated strongly with phase composition in this particular set of samples. Also, the clustering of the X-ray data was found to be exactly coincident with a different sample characteristic "type". Decision tree classification of a larger data set of 86 samples showed that "type" could be defined with very few errors from relatively few splits of the XRF-based compositions. Although composition exhibited strong clustering, measures of performance in these same samples exhibited only very weak clustering. However, performance of the materials could be predicted from linear regression using different slices of the data. Neural nets were attempted for improved predictability of performance beyond linear regression. As expected from the liner regression results, single output linear-based multi-layer perceptrons yielded acceptable predictive capability, but were found to yield notably degraded predictive results if "type" was excluded from the models. The strong dependence of performance on "type" for these samples was an unexpected outcome of the data analysis. Introduction High-Throughput Materials Discovery makes use of automated instrumentation and parallelism to synthesize and test large numbers of specimens (Figure 1). The foundation of this approach is that more can be learned from experiments on a widely diverse set of specimens than from complex, detailed measurements on simple systems or on measurements of a limited number of samples. Automated instrumentation and large numbers of specimens implies that large amounts of data will be generated, implying a strong need for efficient methods in evaluating data from large areas of compositional or parameter space. Although standard experimental design (DOE) statistical tools can provide a basis for selecting parameters and interpreting results, DOE tools are inherently constrained to the parameter space examined. We would like to take knowledge gleaned from a wide diversity of specimens,

2 This document was presented at the Denver X-ray Conference (DXC) on Applications of X-ray Analysis. Sponsored by the International Centre for Diffraction Data (ICDD). This document is provided by ICDD in cooperation with the authors and presenters of the DXC for the express purpose of educating the scientific community. All copyrights for the document are retained by ICDD. Usage is restricted for the purposes of education and scientific research. DXC Website ICDD Website -

3 316 describe our knowledge about these specimens, and develop predictions about regions in parameter space where further studies would be warranted. Describing data and developing predictions falls in the realm of data mining. Instead of the inward deductive data focus of DOE and statistical analysis tools, data mining emphasizes learning from examples and extrapolating to more general descriptive or predictive models through the use of a variety of artificial intelligence, pattern recognition, and machine learning algorithms. Effective data mining is all about how to formulate questions that are meaningful or sensible and how to prepare data to correctly answer those questions. Unfortunately, no general recipes exist for designing good questions nor for preparing data, especially scientific data, although some useful general references are available. 1,2 Types of standard data mining algorithms that might be used to answer questions are listed in Table 1. In this paper, clustering, regression, decision classification trees, and neural nets were used to examine relationships in a dataset that contained both quantitative X-ray fluorescence compositions and X-ray powder diffraction data. Design Experiment (DoE Tools) Data Reduction and Data Mining Database Robotic Synthesis Parallel Screening Figure 1. Ford Motor Company implementation of High-Throughput Materials Discovery Results and Discussion As previously mentioned, one of the biggest challenges in data mining is data preparation. Although many vendors offer very capable software for handling X-ray powder diffraction data, we developed a fully automated empirical algorithm for background subtraction using Python. The algorithm (Equation 1) uses a 6-parameter fit with complex non-linear weighting but requires only a single input parameter from the user specifying an estimate of where background is relative to the last few points at the high-angle end of the scan. The algorithm fits both the low-angle scatter arising from powder surface roughness and the flat background expected at higher angles from off-axis-cut zero-background quartz substrates (Figure 2). Minimization is achieved using a Nelder-Mead simplex. 2 x a6 2 + a3 + a 2 4 a 3 5e y = a1 + a + (Eqn. 1) x x x The advantage of using this algorithm for background subtraction is that all diffraction scans are treated the same and a very large number of data files can be handled very efficiently by listing the filenames in a batch run file. Following background subtraction, the X-ray powder

4 317 diffraction data can then be further processed. For the analyses described below, the X-ray powder diffraction data were subsequently processed using PolySNAP. 3,4 Table 1. Types of Data Mining Algorithms Regression Descriptive Data Classification Models (numerical data) Visualization Other Linear and multiple Statistical exploratory Market basket analyses, a Version space hypotheses regression data analysis priori algorithms Regression and model trees Decision trees and lists Hierarchical clustering Textual analyses Adaptive neural nets, Image analysis and Instance-based classifiers K-means clustering multilayer nets segmentation Genetic algorithms Perceptron neural nets Expectation Maximization clustering Genetic algorithms Bayesian inference Figure 2. Two X-ray powder diffraction scans showing the effectiveness of the new algorithm in fitting a background. The red line is the fitted background, y in Equation 1. The X-ray powder diffraction data were obtained with either a PAD-V or an X2 Scintag powder diffractometer equipped with a copper-target X-ray tube. Data were collected with continuous scans and electronic integration over θ. The X-ray fluorescence data were obtained with a Philips PW2400 with a chromium tube using UniQuant5 and sensitivities optimized using additional in-house calibration standards and with background channels customized to better handle the chemistries of these samples. The resulting output of oxide weight percentages was converted to moles of each element. The data were prepared such that relationships between phase composition, elemental composition, and performance could be examined. Merging data from different characterization techniques yielded two data sets: (1) a set of 47 samples with X-ray powder diffraction (XRD) data, elemental compositions from X-ray fluorescence (XRF), surface area, and four measures of performance; (2) a related data set containing 86 samples with XRF data, surface area, a parameter for history (sample aging), and four measures of performance but without XRD data.

5 318 Data sets (1) and (2) were initially examined for natural groupings in the data with clustering. STATISTICA 5 was used for hierarchical clustering of the XRF, surface area, and performance data. Similarity clustering of the XRF, surface area, and performance data in various combinations with the XRD data was accomplished with the three-way multidimensional scaling of PolySNAP. 6,7 For more predictive models, regression and decision tree classification were accomplished with the open-source software WEKA 6. Neural nets were developed using STATISTICA Neural Nets. (a) (b) Figure 3. (a) The clustering of the XRPD data in data set (1) by multi-dimensional scaling in PolySNAP. (b) The clustering of the XRF data in data set (1) also by multi-dimensional scaling. Although difficult to see in these images, the cluster membership is exactly the same for both types of data. (a) Figure 4. From PolySNAP using data set (1), similarity clustering of a subset of the elemental data from XRF (a) without surface area and (b) with surface area included. (b)

6 319 As illustrated in Figure 3, cluster membership is found to be the same for both types of X-ray data, XRD and XRF. Therefore, the phase composition has a strong relationship to elemental composition. Different variations in specimen composition are related to the presence of different phases. Examination of the cluster membership shows that the members accurately reflect a descriptor sample type that was derived from other information unrelated to any chemical or characterization measurements, e.g., sample type reflects the source from which the chemicals originated. Figure 5. Similarity clustering from PolySNAP that results from adding XRD data to surface area and a subset of XRF elemental data (data set 1). Our knowledge of the samples tells us that not all of the elemental composition should be related to sample type. Manually selecting a subset of the XRF data enables probing relationships beyond the influence of sample type. However, the subset of XRF data exhibits relatively weak clustering (Figure 4a). Including surface area with the XRF data changes the clustering membership (Figure 4b) but does not strengthen the relationships. Hierarchical clustering using complete Euclidean linkage distances for the same subset of XRF data but from the larger data set (2) of 86 samples still yields poor clustering with very small linkage distances. However, inclusion of surface area in the hierarchical clustering of the larger data set does yield more numerically significant linkage distances and more distinct clusters. Not surprisingly, because the XRD data contain information so strongly related to sample type (Figure 3a), the addition of XRD clustering to the XRF subset-surface area clustering imposes a more definite structure in the overall clustering (Figure 5). Nevertheless, surface area and the XRF subset of data influences the cluster membership compared with Figure 3a. Examination of the clustering relationships for the four measures of performance indicates that the performance data alone show no strong tendency to cluster. The larger data set (2) of 86 samples but without XRD data was used to test for efficacy in predicting performance. For the prediction model building, selecting amongst the possible

7 320 twenty-one primary variables, thirteen derived variables, and four response variables was accomplished by either the independent feature selection heuristic of STATISTICA Data Miner or by using in each technique the embedded algorithms that selectively add or subtract parameters. Rather surprisingly, linear regression models for all four measures of performance could be found with correlation coefficients ranging from 0.84 to Different combinations of XRF elements, surface area, and history parameter yield statistically comparable models although all models included sample type. Decision tree classification shows that sample type can be defined with very few errors from relatively few splits of the XRF-based compositions, which is consistent with the clustering observed using PolySNAP (Figure 3b). To examine the influence of sample type on the regression models, sample type and elements defining sample type were excluded, but the history parameter and various combinations of surface area with remaining XRF elements were included. Nevertheless, the correlation coefficients for the linear regression models dropped significantly to 0.77 to This leads us to conclude that the measures of performance that were tested do depend to some extent on aging history of the specimens, surface area, and other aspects of composition besides sample type, but that for these particular materials, sample type is a significant factor related to the performance of the materials. Predictive models developed using neural nets show the same trend; predictions are notably degraded without the inclusion of sample type. The predictive capabilities of neural net models are further degraded if multiple predictions are attempted. This may suggest that the parameters remaining after removing sample type may be only weakly related to performance and may be insufficiently independent to successfully predict material performance. Conclusions XRD phase composition and XRF elemental composition were found to yield the same clustering and, hence, both types of X-ray data have a strong relationship to each other in the specimens examined. Cluster membership of the X-ray data was found to be indicative of an unrelated descriptor sample type. Models developed for these data sets needed the inclusion of sample type to be effective in predicting performance. Although the dependence on sample type is, perhaps, not surprising in retrospect, models independent of sample type would be more useful. Hence, the next step for extending our data mining is to find other descriptors that improve prediction of performance without requiring the inclusion of sample type in the model. Dimensionality reduction of spectral-type X-ray data may yield other descriptors useful for modeling performance. Improved predictive models would guide us to other regions in parameter space in which to search for new or optimized materials.

8 321 References 1 Data Mining, Ian Witten and Eibe Frank (2000); Machine Learning, Tom Mitchell (1997); Data Mining: Concepts and Techniques, J. Han and M. Kamber (2001); Data Mining: Concepts, Models, Methods, and Algorithms, Mehmed Kantardzic (2003). 2 Data Preparation for Data Mining, Dorian Pyle (1999). 3 PolySNAP, Brucker AXS; also G. Barr, W. Dong, C.J. Gilmore (2004). PolySNAP: a computer program for analysing high-throughput powder diffraction data, J. Appl. Cryst. 37, C.J. Gilmore, G. Barr, J. Paisley (2004). High-throughput powder diffraction. I. A new approach to qualitative and quantitative powder diffraction pattern analysis using full pattern profiles, J. Appl. Cryst. 37, 231; G. Barr, W. Dong, D.J. Gilmore (2004). High-throughput powder diffraction. II. Applications of clustering methods and multivariate data analysis, J. Appl. Cryst. 37, StatSoft, Inc. (2005). STATISTICA 7.1 or STATISTICA Data Miner, version Ian H. Witten and Eibe Frank (2005). "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco; see also Weka 3: Data Mining Software in Java, 7 Using WEKA s Greedy algorithm for linear regression models with the outlier (sample P31) removed.

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

Applications of New, High Intensity X-Ray Optics - Normal and thin film diffraction using a parabolic, multilayer mirror

Applications of New, High Intensity X-Ray Optics - Normal and thin film diffraction using a parabolic, multilayer mirror Applications of New, High Intensity X-Ray Optics - Normal and thin film diffraction using a parabolic, multilayer mirror Stephen B. Robie scintag, Inc. 10040 Bubb Road Cupertino, CA 95014 Abstract Corundum

More information

Principles of Dat Da a t Mining Pham Tho Hoan [email protected] [email protected]. n

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n Principles of Data Mining Pham Tho Hoan [email protected] References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is

More information

Predictive Dynamix Inc

Predictive Dynamix Inc Predictive Modeling Technology Predictive modeling is concerned with analyzing patterns and trends in historical and operational data in order to transform data into actionable decisions. This is accomplished

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

A Comparative Study of clustering algorithms Using weka tools

A Comparative Study of clustering algorithms Using weka tools A Comparative Study of clustering algorithms Using weka tools Bharat Chaudhari 1, Manan Parikh 2 1,2 MECSE, KITRC KALOL ABSTRACT Data clustering is a process of putting similar data into groups. A clustering

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information

An Introduction to WEKA. As presented by PACE

An Introduction to WEKA. As presented by PACE An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/

More information

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining + Business Intelligence. Integration, Design and Implementation Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution

More information

DEA implementation and clustering analysis using the K-Means algorithm

DEA implementation and clustering analysis using the K-Means algorithm Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

Introduction Predictive Analytics Tools: Weka

Introduction Predictive Analytics Tools: Weka Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Data Mining and Soft Computing. Francisco Herrera

Data Mining and Soft Computing. Francisco Herrera Francisco Herrera Research Group on Soft Computing and Information Intelligent Systems (SCI 2 S) Dept. of Computer Science and A.I. University of Granada, Spain Email: [email protected] http://sci2s.ugr.es

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: [email protected]

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: [email protected] Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Subject Description Form

Subject Description Form Subject Description Form Subject Code Subject Title COMP417 Data Warehousing and Data Mining Techniques in Business and Commerce Credit Value 3 Level 4 Pre-requisite / Co-requisite/ Exclusion Objectives

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

Introduction to Machine Learning Using Python. Vikram Kamath

Introduction to Machine Learning Using Python. Vikram Kamath Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression

More information

Meta-learning. Synonyms. Definition. Characteristics

Meta-learning. Synonyms. Definition. Characteristics Meta-learning Włodzisław Duch, Department of Informatics, Nicolaus Copernicus University, Poland, School of Computer Engineering, Nanyang Technological University, Singapore [email protected] (or search

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Visualizing class probability estimators

Visualizing class probability estimators Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers

More information

Data Mining and Visualization

Data Mining and Visualization Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Weather forecast prediction: a Data Mining application

Weather forecast prediction: a Data Mining application Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,[email protected],8407974457 Abstract

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Data mining and official statistics

Data mining and official statistics Quinta Conferenza Nazionale di Statistica Data mining and official statistics Gilbert Saporta président de la Société française de statistique 5@ S Roma 15, 16, 17 novembre 2000 Palazzo dei Congressi Piazzale

More information

Quantifying Amorphous Phases. Kern, A., Madsen, I.C. and Scarlett, N.V.Y.

Quantifying Amorphous Phases. Kern, A., Madsen, I.C. and Scarlett, N.V.Y. Kern, A., Madsen, I.C. and Scarlett, N.V.Y. This document was presented at PPXRD - Pharmaceutical Powder X-ray Diffraction Symposium Sponsored by The International Centre for Diffraction Data This presentation

More information

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

More information

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report G. Banos 1, P.A. Mitkas 2, Z. Abas 3, A.L. Symeonidis 2, G. Milis 2 and U. Emanuelson 4 1 Faculty

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application in the DJ Basin.

Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application in the DJ Basin. Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application in the DJ Basin. Petroleum & Natural Gas Engineering West Virginia University SPE Annual Technical

More information

The Predictive Data Mining Revolution in Scorecards:

The Predictive Data Mining Revolution in Scorecards: January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

More information

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein

More information

Random forest algorithm in big data environment

Random forest algorithm in big data environment Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest

More information

MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

How To Predict Web Site Visits

How To Predict Web Site Visits Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

More information

Data Mining and Business Intelligence CIT-6-DMB. http://blackboard.lsbu.ac.uk. Faculty of Business 2011/2012. Level 6

Data Mining and Business Intelligence CIT-6-DMB. http://blackboard.lsbu.ac.uk. Faculty of Business 2011/2012. Level 6 Data Mining and Business Intelligence CIT-6-DMB http://blackboard.lsbu.ac.uk Faculty of Business 2011/2012 Level 6 Table of Contents 1. Module Details... 3 2. Short Description... 3 3. Aims of the Module...

More information

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

College of Health and Human Services. Fall 2013. Syllabus

College of Health and Human Services. Fall 2013. Syllabus College of Health and Human Services Fall 2013 Syllabus information placement Instructor description objectives HAP 780 : Data Mining in Health Care Time: Mondays, 7.20pm 10pm (except for 3 rd lecture

More information

Ensemble Data Mining Methods

Ensemble Data Mining Methods Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

More information

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Regression Problem (adapted from) Prof. Alexander Ihler Overview Regression Problem Definition and define parameters ϴ. Prediction using ϴ as parameters Measure the error

More information

Three Perspectives of Data Mining

Three Perspectives of Data Mining Three Perspectives of Data Mining Zhi-Hua Zhou * National Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China Abstract This paper reviews three recent books on data mining

More information

King Saud University

King Saud University King Saud University College of Computer and Information Sciences Department of Computer Science CSC 493 Selected Topics in Computer Science (3-0-1) - Elective Course CECS 493 Selected Topics: DATA MINING

More information

Introduction to X-Ray Powder Diffraction Data Analysis

Introduction to X-Ray Powder Diffraction Data Analysis Introduction to X-Ray Powder Diffraction Data Analysis Center for Materials Science and Engineering at MIT http://prism.mit.edu/xray An X-ray diffraction pattern is a plot of the intensity of X-rays scattered

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

Learning is a very general term denoting the way in which agents:

Learning is a very general term denoting the way in which agents: What is learning? Learning is a very general term denoting the way in which agents: Acquire and organize knowledge (by building, modifying and organizing internal representations of some external reality);

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support Rok Rupnik, Matjaž Kukar, Marko Bajec, Marjan Krisper University of Ljubljana, Faculty of Computer and Information

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Machine Learning. 01 - Introduction

Machine Learning. 01 - Introduction Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge

More information

Data Mining and Neural Networks in Stata

Data Mining and Neural Networks in Stata Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca [email protected] [email protected]

More information

Predictive Modeling and Big Data

Predictive Modeling and Big Data Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation

More information

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant

More information

Predicting Student Performance by Using Data Mining Methods for Classification

Predicting Student Performance by Using Data Mining Methods for Classification BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

Neural Networks in Data Mining

Neural Networks in Data Mining IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V6 PP 01-06 www.iosrjen.org Neural Networks in Data Mining Ripundeep Singh Gill, Ashima Department

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information

An Introduction to Advanced Analytics and Data Mining

An Introduction to Advanced Analytics and Data Mining An Introduction to Advanced Analytics and Data Mining Dr Barry Leventhal Henry Stewart Briefing on Marketing Analytics 19 th November 2010 Agenda What are Advanced Analytics and Data Mining? The toolkit

More information