On the Role of Data Mining Techniques in Uncertainty Quantification

Size: px
Start display at page:

Download "On the Role of Data Mining Techniques in Uncertainty Quantification"

Transcription

1 On the Role of Data Mining Techniques in Uncertainty Quantification Chandrika Kamath Lawrence Livermore National Laboratory Livermore, CA, USA USA/South America Symposium on Stochastic Modeling and Uncertainty Quantification August 2011 LLNL-PRES This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 1 / 39

2 Outline 1 Understanding data mining: A sample problem 2 Overview of the data mining process 3 Data mining techniques in uncertainty quantification Preprocessing data Identifying objects Extracting features Reducing the dimension Building the models Generating the data 4 Closing thoughts Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 2 / 39

3 Understanding data mining: A sample problem Understanding scientific data mining using a simple example Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 3 / 39

4 Understanding data mining: A sample problem Classification of orbits in Poincaré plots Quasiperiodic Island chain Separatrix Stochastic Task: Given a training set of example orbits, with their associated classes, build a predictive model which, given an orbit, will assign it a class. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 4 / 39

5 Understanding data mining: A sample problem First, we need to identify a training set - this is challenging The labeling is tedious, subjective, and error-prone. There is variation within orbits of a class. The orbit may change class as more points are added. The orbits exhibit multi-scale and fractal behavior. The data are noisy. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 5 / 39

6 Understanding data mining: A sample problem Next, we need to find a suitable representation for the orbit Given: the (x, y) coordinates of the points in an orbit Want: to differentiate among the orbits based on their shape = we need to find an appropriate representation of the orbit Objects: the parts of the data which are of interest (e.g., an orbit) Features: low level measurements representing the objects (e.g., #islands) Good features are representative can be used to differentiate between classes are invariant to scale, rotation, and translation are robust (insensitive to small changes in data) Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 6 / 39

7 Understanding data mining: A sample problem We extracted several features for the orbits Convert the data to polar coordinates Consider the points in a window around each point; scale so δr = 1.0 Extract simple features: concentration of points, error in fitting a second order polynomial, quadrat distribution of points,... Extract derived features: number of times the peaks/valleys in concentration and error are aligned,... Conversion to polar coordinates Characteristics of a separatrix Plot of lserr vs. conc Appropriate features are essential to building an accurate predictive model. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 7 / 39

8 Understanding data mining: A sample problem The original raw data is now an object x feature matrix Given the orbits represented as points in d-dimensional feature space,... f 1 f 2... f d... we want to build a predictive model to separate the classes. O 1 f 11 f f 1d O 2 f 21 f f 2d O n f n1 f n2... f nd But first, it is helpful to take a closer look at the matrix Some features may be correlated Others may be irrelevant to the class label The samples required to build an accurate model grow exponentially with number of features Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 8 / 39

9 Understanding data mining: A sample problem We can use visualization tools to explore the matrix,... Parallel plot: f7, f10, and f15 are discriminatory Scatter plot of feature f10 vs f7 Simple visual tools are invaluable in identifying outliers and correlated variables evaluating the quality of features But unsuitable if the data sets are very large or high dimensional Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 9 / 39

10 Understanding data mining: A sample problem... or dimension reduction to identify key features distance t-statistic chi-squared stump filter filter filter filter f3 f8 f8 f18 f8 f3 f7 f3 f5 f5 f3 f6 f6 f6 f5 f5 f7 f7 f6 f8 f18 f16 f18 f7 f16 f4 f16 f4 Features selected as important relate to errors in fitting polynomial to points in a window (f3, f5, f6) the spread of points in a window (f7, f8) the trend in error vs. concentration (f16) gaps in the orbit larger than 10 degrees (f18, a binary variable) Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 10 / 39

11 Understanding data mining: A sample problem We can now use the data to build a predictive model Decision tree Splitting criterion: Gini: split which most reduces node impurity Information gain: split which most reduces entropy 1884 examples: 778 quasiperiodic 440 island chains 416 stochastic 250 separatrix To assign a label to an orbit, extract the features and apply the model. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 11 / 39

12 Overview of the data mining process Overview of the scientific data mining process Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 12 / 39

13 Overview of the data mining process Data mining terminology Data mining: the semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures in data Pattern recognition: the discovery and characterization of patterns Pattern: an ordering with underlying structure Feature: an extractable measurement or attribute Data mining borrows ideas from many fields, including machine learning, statistics, pattern recognition, signal/image/video processing, linear algebra, mathematical optimization, high-performance computing,... Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 13 / 39

14 Overview of the data mining process The steps in mining science data are driven by the raw data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 14 / 39

15 Overview of the data mining process Observations on the data mining process Data mining is an iterative and interactive process. It requires close collaboration with the domain scientists. The pre-processing step is critical; it is problem-dependent, and often the most time consuming part. Software can seldom be used as a black-box. Not all tasks are used in all problems. Some algorithms can be used in more than one task. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 15 / 39

16 Data mining techniques in uncertainty quantification Data mining techniques in uncertainty quantification Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 16 / 39

17 Data mining techniques in uncertainty quantification Preprocessing data Preprocessing data to improve their quality Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 17 / 39

18 Data mining techniques in uncertainty quantification Preprocessing data Experimental data often require cleanup before analysis Denoising: filters (mean, Gaussian), model-based methods, PDEs,... Contrast enhancement: histogram equalization, Retinex,... Original Denoised Original After Retinex Comparing simulations to experiments for Richtmyer-Meshkov instability Extracting statistics from images of materials fragmentation Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 18 / 39

19 Data mining techniques in uncertainty quantification Identifying objects Identifying objects of interest in the data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 19 / 39

20 Data mining techniques in uncertainty quantification Identifying objects We can focus on the boundaries of the objects,... Gradient-based methods: simple filters, Canny edge detector, SUSAN,... PDE-based methods: level sets Denoised experimental data Original experimental data Simulation data Comparing simulations to experiments in Richtmyer-Meshkov instability Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 20 / 39

21 Data mining techniques in uncertainty quantification Identifying objects... or, we can focus on the interior of the objects,... Region growing methods, watershed approaches,... PDE-based methods: active contours without edges Finding the bubble surface in 3-D Identifying the bubbles in 2-D Identifying bubbles and spikes in Rayleigh-Taylor instability Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 21 / 39

22 Data mining techniques in uncertainty quantification Identifying objects... or, we can use domain specific methods Fit elliptic Gaussians to the blobs Identifying bent double galaxies in the FIRST survey Find threshold using histogram Extracting statistics on material fragments Find threshold using heuristics Finding coherent structures in fusion plasma simulations Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 22 / 39

23 Data mining techniques in uncertainty quantification Extracting features Extracting representative features for the objects Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 23 / 39

24 Data mining techniques in uncertainty quantification Extracting features Features are usually problem- and data-dependent Extract angle-based features Identifying bent double galaxies in the FIRST survey Comparisons in Richtmyer-Meshkov instability Extract features representing texture, shape,... Similarity-based object retrieval for simulations Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 24 / 39

25 Data mining techniques in uncertainty quantification Reducing the dimension Reducing the dimension of the data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 25 / 39

26 Data mining techniques in uncertainty quantification Reducing the dimension There are several reasons we need to reduce the dimension The dimension of a problem = number of features describing an object We want to identify the important variables to monitor. Lower dimensional data are easier to visualize and understand. Irrelevant features may lower the accuracy of classifiers. More samples are needed to cover a high dimensional space. Nearest-neighbor may lose meaning in high dimensional spaces. The computational cost to extract, store, and use the features may be a consideration. Data structures for fast searches do not scale to high dimensions. Domain scientists can also identify features that are likely to be important. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 26 / 39

27 Data mining techniques in uncertainty quantification Reducing the dimension Dimension reduction can be done by feature selection,... Feature selection techniques are used in classification and regression. Distance filter Filter methods (distance filter, stump filter, Relief,...) evaluate how well the features discriminate among the classes Relief Wrapper methods use the model to evaluate the subset selected: Forward search Backward search Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 27 / 39

28 Data mining techniques in uncertainty quantification Reducing the dimension or by using feature transform methods Principal component analysis is a commonly used linear technique Non-linear methods include: Isomap, LLE, Laplacian eigenmaps,... Random projections (the Johnson-Lindenstrauss lemma) are an option. The connection to the original features is no longer clear. The non-linear techniques are based on nearest neighbors. Some methods are computationally expensive. The intrinsic dimensionality of the data is of concern. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 28 / 39

29 Data mining techniques in uncertainty quantification Reducing the dimension Our experiences indicate feature selection works well FirstTriples Wind stump filter distance filter chi-squared filter isomapy_ep36 isomapy_k11 lley_k41 lapy_ep36_t stump filter distance filter chi-squared filter isomapy_ep28 lley_ep28 lapy_ep22_t Percentage Error Rate Percentage error rate Number of Features Number of features Identifying bent-double galaxies in the FIRST survey Identifying weather conditions associated with ramp events in wind generation Feature selection techniques tend to be more accurate, give interpretable results, and are computationally less expensive. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 29 / 39

30 Data mining techniques in uncertainty quantification Building the models Building the models Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 30 / 39

31 Data mining techniques in uncertainty quantification Building the models We can build different kinds of models from data Scaling laws Predictive models (classification/regression) Descriptive models (clustering) Power law model for bubble counts in Rayleigh-Taylor instability Examples: decision trees, support vector machines, neural networks, locally weighted learning, naïve Bayes,..., and ensembles Examples: k-means, agglomerative and divisive methods, graph-based approaches... We have a choice of many algorithms. In our experience, the results are dependent more on the data and the pre-processing, than the algorithm. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 31 / 39

32 Data mining techniques in uncertainty quantification Building the models There are several considerations in building a model Classification: unbalanced training set incorrect labels need for interpretable models Clustering: defining a similarity metric Scaling laws: confirming that it is indeed a power law Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 32 / 39

33 Data mining techniques in uncertainty quantification Generating the data The role of analysis techniques in generating the data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 33 / 39

34 Data mining techniques in uncertainty quantification Generating the data There is also interest in the process of generating the data Motivation: low quality of data, easy availability of computational resources, new kinds of problems being addressed,... There is a large amount of unlabeled data, but labeling is expensive - which samples should we label? We want to run simulations to understand how an experiment will perform - what should be the inputs? How do we ensure there is enough variation in the ensembles? We want to understand the design space to create a new material using simulations - where should we place our sample points? Inverse problems - what input should I use to get a desired output? My simulations are expensive; is there a surrogate I can use? The intent is to close the gap between data acquisition and model building. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 34 / 39

35 Data mining techniques in uncertainty quantification Generating the data Several ideas from data mining are relevant Active learning, semi-supervised learning, relevance feedback,... Code surrogates, meta-modeling,... Dimension reduction Multi-task learning Sampling Design of (computational) experiments This is an active area of research with many ideas being explored. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 35 / 39

36 Closing thoughts Closing Thoughts Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 36 / 39

37 Closing thoughts Caveats and notes of caution Scientific data mining is a careful and considered application of techniques in close collaboration with the domain scientists. A blind use of techniques/software is not recommended. Data pre-processing is a critical, but time-consuming, part of the process. Using more than one technique to solve a problem is helpful. Analysis results must be interpreted carefully before making a decision. Many different fields contribute to data mining, each with its own perspectives and insights. The first principle is that you must not fool yourself - and you are the easiest person to fool. Richard P. Feynmann Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 37 / 39

38 Closing thoughts Challenges and current topics of research Streaming data: concept drift, real-time analysis,... Size of the data, exa-scale computing,... Complex data: multivariate, multi-sensor, multi-scale, multi-modal,... New analysis problems: application of techniques to generation of data, problems in simulation data,... Reasoning and decision making in the presence of uncertainty... Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 38 / 39

39 Acknowledgments Closing thoughts Data miners and software developers: Erick Cantú-Paz, Samson Cheung, Ya Ju Fan, Abel Gezahegne, Cyrus Harrison, Thinh Nguyen, Nu Ai Tang,... Domain scientists: Robert Becker and the FIRST astronomers, Joshua Breslau, Omar Hurricane, Jeffrey Greenough, Jeffrey Jacobs, Scott Klasky, Zhihong Lin, Paul Miller, Donald Monticello, Neil Pomphrey,... Funding sources: ASC program (DOE), ASCR Base Research Program (DOE), the LDRD program at Lawrence Livermore National Laboratory, SciDAC program (DOE), and WHTP/EERE (DOE). For more details Chandrika Kamath, Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 39 / 39

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Exploratory Data Analysis with MATLAB

Exploratory Data Analysis with MATLAB Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Data Mining mit der JMSL Numerical Library for Java Applications

Data Mining mit der JMSL Numerical Library for Java Applications Data Mining mit der JMSL Numerical Library for Java Applications Stefan Sineux 8. Java Forum Stuttgart 07.07.2005 Agenda Visual Numerics JMSL TM Numerical Library Neuronale Netze (Hintergrund) Demos Neuronale

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

ADVANCED MACHINE LEARNING. Introduction

ADVANCED MACHINE LEARNING. Introduction 1 1 Introduction Lecturer: Prof. Aude Billard ([email protected]) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures

More information

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

NONLINEAR TIME SERIES ANALYSIS

NONLINEAR TIME SERIES ANALYSIS NONLINEAR TIME SERIES ANALYSIS HOLGER KANTZ AND THOMAS SCHREIBER Max Planck Institute for the Physics of Complex Sy stems, Dresden I CAMBRIDGE UNIVERSITY PRESS Preface to the first edition pug e xi Preface

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,

More information

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19 PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

KEITH LEHNERT AND ERIC FRIEDRICH

KEITH LEHNERT AND ERIC FRIEDRICH MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca [email protected] Spain Manuel Martín-Merino Universidad

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, [email protected]) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data 100 001 010 111 From Raw Data to 10011100 Actionable Insights with 00100111 MATLAB Analytics 01011100 11100001 1 Access and Explore Data For scientists the problem is not a lack of available but a deluge.

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group Practical Data Mining Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint of the Taylor Ei Francis Group, an Informs

More information

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML www.bsc.es A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML Josep Ll. Berral, Nicolas Poggi, David Carrera Workshop on Big Data Benchmarks Toronto, Canada 2015 1 Context ALOJA: framework

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

What is Data mining?

What is Data mining? STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Data Mining Part 5. Prediction

Data Mining Part 5. Prediction Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

More information

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

MS1b Statistical Data Mining

MS1b Statistical Data Mining MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

A New Approach for Evaluation of Data Mining Techniques

A New Approach for Evaluation of Data Mining Techniques 181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Advanced In-Database Analytics

Advanced In-Database Analytics Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??

More information

Galaxy Morphological Classification

Galaxy Morphological Classification Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,

More information

King Saud University

King Saud University King Saud University College of Computer and Information Sciences Department of Computer Science CSC 493 Selected Topics in Computer Science (3-0-1) - Elective Course CECS 493 Selected Topics: DATA MINING

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd [email protected]/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Algebra 1 Course Information

Algebra 1 Course Information Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through

More information

TDA and Machine Learning: Better Together

TDA and Machine Learning: Better Together TDA and Machine Learning: Better Together TDA AND MACHINE LEARNING: BETTER TOGETHER 2 TABLE OF CONTENTS The New Data Analytics Dilemma... 3 Introducing Topology and Topological Data Analysis... 3 The Promise

More information

Classification Techniques for Remote Sensing

Classification Techniques for Remote Sensing Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara [email protected] http://www.cs.bilkent.edu.tr/ saksoy/courses/cs551

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Big Data: Image & Video Analytics

Big Data: Image & Video Analytics Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

A Simple Feature Extraction Technique of a Pattern By Hopfield Network A Simple Feature Extraction Technique of a Pattern By Hopfield Network A.Nag!, S. Biswas *, D. Sarkar *, P.P. Sarkar *, B. Gupta **! Academy of Technology, Hoogly - 722 *USIC, University of Kalyani, Kalyani

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information