On the Role of Data Mining Techniques in Uncertainty Quantification

Similar documents
The Scientific Data Mining Process

Environmental Remote Sensing GEOG 2021

Supervised Feature Selection & Unsupervised Dimensionality Reduction

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Social Media Mining. Data Mining Essentials

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Exploratory Data Analysis with MATLAB

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Azure Machine Learning, SQL Data Mining and R

Chapter ML:XI (continued)

Introduction to Pattern Recognition

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining mit der JMSL Numerical Library for Java Applications

An Overview of Knowledge Discovery Database and Data mining Techniques

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining Classification: Decision Trees

ADVANCED MACHINE LEARNING. Introduction

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Machine Learning using MapReduce

A Survey on Pre-processing and Post-processing Techniques in Data Mining

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Sanjeev Kumar. contribute

Data Preprocessing. Week 2

NONLINEAR TIME SERIES ANALYSIS

Principles of Data Mining by Hand&Mannila&Smyth

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Knowledge Discovery from patents using KMX Text Analytics

KEITH LEHNERT AND ERIC FRIEDRICH

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Neural Networks Lesson 5 - Cluster Analysis

Chapter 12 Discovering New Knowledge Data Mining

1. Classification problems

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Analysis Tools and Libraries for BigData

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Distributed forests for MapReduce-based machine learning

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Supervised Learning (Big Data Analytics)

STA 4273H: Statistical Machine Learning

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Decision Trees from large Databases: SLIQ

Clustering & Visualization

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Chapter 6. The stacking ensemble approach

Predict Influencers in the Social Network

An Introduction to Data Mining

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Statistics for BIG data

Data Mining Practical Machine Learning Tools and Techniques

Lecture 9: Introduction to Pattern Analysis

What is Data mining?

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Statistical Models in Data Mining

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining Part 5. Prediction

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Unsupervised Data Mining (Clustering)

Comparison of K-means and Backpropagation Data Mining Algorithms

Information Management course

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Data Mining. Nonlinear Classification

MS1b Statistical Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

DATA MINING TECHNIQUES AND APPLICATIONS

A New Approach for Evaluation of Data Mining Techniques

Component Ordering in Independent Component Analysis Based on Data Power

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Advanced In-Database Analytics

Galaxy Morphological Classification

King Saud University

Analecta Vol. 8, No. 2 ISSN

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

not possible or was possible at a high cost for collecting the data.

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Classification algorithm in Data mining: An Overview

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Algebra 1 Course Information

TDA and Machine Learning: Better Together

Classification Techniques for Remote Sensing

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Big Data: Image & Video Analytics

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

Cluster Analysis: Advanced Concepts

Transcription:

On the Role of Data Mining Techniques in Uncertainty Quantification Chandrika Kamath Lawrence Livermore National Laboratory Livermore, CA, USA kamath2@llnl.gov USA/South America Symposium on Stochastic Modeling and Uncertainty Quantification August 2011 LLNL-PRES-491401 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 1 / 39

Outline 1 Understanding data mining: A sample problem 2 Overview of the data mining process 3 Data mining techniques in uncertainty quantification Preprocessing data Identifying objects Extracting features Reducing the dimension Building the models Generating the data 4 Closing thoughts Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 2 / 39

Understanding data mining: A sample problem Understanding scientific data mining using a simple example Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 3 / 39

Understanding data mining: A sample problem Classification of orbits in Poincaré plots Quasiperiodic Island chain Separatrix Stochastic Task: Given a training set of example orbits, with their associated classes, build a predictive model which, given an orbit, will assign it a class. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 4 / 39

Understanding data mining: A sample problem First, we need to identify a training set - this is challenging The labeling is tedious, subjective, and error-prone. There is variation within orbits of a class. The orbit may change class as more points are added. The orbits exhibit multi-scale and fractal behavior. The data are noisy. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 5 / 39

Understanding data mining: A sample problem Next, we need to find a suitable representation for the orbit Given: the (x, y) coordinates of the points in an orbit Want: to differentiate among the orbits based on their shape = we need to find an appropriate representation of the orbit Objects: the parts of the data which are of interest (e.g., an orbit) Features: low level measurements representing the objects (e.g., #islands) Good features are representative can be used to differentiate between classes are invariant to scale, rotation, and translation are robust (insensitive to small changes in data) Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 6 / 39

Understanding data mining: A sample problem We extracted several features for the orbits Convert the data to polar coordinates Consider the points in a window around each point; scale so δr = 1.0 Extract simple features: concentration of points, error in fitting a second order polynomial, quadrat distribution of points,... Extract derived features: number of times the peaks/valleys in concentration and error are aligned,... Conversion to polar coordinates Characteristics of a separatrix Plot of lserr vs. conc Appropriate features are essential to building an accurate predictive model. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 7 / 39

Understanding data mining: A sample problem The original raw data is now an object x feature matrix Given the orbits represented as points in d-dimensional feature space,... f 1 f 2... f d... we want to build a predictive model to separate the classes. O 1 f 11 f 12... f 1d O 2 f 21 f 22... f 2d............. O n f n1 f n2... f nd But first, it is helpful to take a closer look at the matrix Some features may be correlated Others may be irrelevant to the class label The samples required to build an accurate model grow exponentially with number of features Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 8 / 39

Understanding data mining: A sample problem We can use visualization tools to explore the matrix,... Parallel plot: f7, f10, and f15 are discriminatory Scatter plot of feature f10 vs f7 Simple visual tools are invaluable in identifying outliers and correlated variables evaluating the quality of features But unsuitable if the data sets are very large or high dimensional Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 9 / 39

Understanding data mining: A sample problem... or dimension reduction to identify key features distance t-statistic chi-squared stump filter filter filter filter f3 f8 f8 f18 f8 f3 f7 f3 f5 f5 f3 f6 f6 f6 f5 f5 f7 f7 f6 f8 f18 f16 f18 f7 f16 f4 f16 f4 Features selected as important relate to errors in fitting polynomial to points in a window (f3, f5, f6) the spread of points in a window (f7, f8) the trend in error vs. concentration (f16) gaps in the orbit larger than 10 degrees (f18, a binary variable) Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 10 / 39

Understanding data mining: A sample problem We can now use the data to build a predictive model Decision tree Splitting criterion: Gini: split which most reduces node impurity Information gain: split which most reduces entropy 1884 examples: 778 quasiperiodic 440 island chains 416 stochastic 250 separatrix To assign a label to an orbit, extract the features and apply the model. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 11 / 39

Overview of the data mining process Overview of the scientific data mining process Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 12 / 39

Overview of the data mining process Data mining terminology Data mining: the semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures in data Pattern recognition: the discovery and characterization of patterns Pattern: an ordering with underlying structure Feature: an extractable measurement or attribute Data mining borrows ideas from many fields, including machine learning, statistics, pattern recognition, signal/image/video processing, linear algebra, mathematical optimization, high-performance computing,... Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 13 / 39

Overview of the data mining process The steps in mining science data are driven by the raw data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 14 / 39

Overview of the data mining process Observations on the data mining process Data mining is an iterative and interactive process. It requires close collaboration with the domain scientists. The pre-processing step is critical; it is problem-dependent, and often the most time consuming part. Software can seldom be used as a black-box. Not all tasks are used in all problems. Some algorithms can be used in more than one task. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 15 / 39

Data mining techniques in uncertainty quantification Data mining techniques in uncertainty quantification Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 16 / 39

Data mining techniques in uncertainty quantification Preprocessing data Preprocessing data to improve their quality Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 17 / 39

Data mining techniques in uncertainty quantification Preprocessing data Experimental data often require cleanup before analysis Denoising: filters (mean, Gaussian), model-based methods, PDEs,... Contrast enhancement: histogram equalization, Retinex,... Original Denoised Original After Retinex Comparing simulations to experiments for Richtmyer-Meshkov instability Extracting statistics from images of materials fragmentation Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 18 / 39

Data mining techniques in uncertainty quantification Identifying objects Identifying objects of interest in the data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 19 / 39

Data mining techniques in uncertainty quantification Identifying objects We can focus on the boundaries of the objects,... Gradient-based methods: simple filters, Canny edge detector, SUSAN,... PDE-based methods: level sets Denoised experimental data Original experimental data Simulation data Comparing simulations to experiments in Richtmyer-Meshkov instability Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 20 / 39

Data mining techniques in uncertainty quantification Identifying objects... or, we can focus on the interior of the objects,... Region growing methods, watershed approaches,... PDE-based methods: active contours without edges Finding the bubble surface in 3-D Identifying the bubbles in 2-D Identifying bubbles and spikes in Rayleigh-Taylor instability Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 21 / 39

Data mining techniques in uncertainty quantification Identifying objects... or, we can use domain specific methods Fit elliptic Gaussians to the blobs Identifying bent double galaxies in the FIRST survey Find threshold using histogram Extracting statistics on material fragments Find threshold using heuristics Finding coherent structures in fusion plasma simulations Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 22 / 39

Data mining techniques in uncertainty quantification Extracting features Extracting representative features for the objects Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 23 / 39

Data mining techniques in uncertainty quantification Extracting features Features are usually problem- and data-dependent Extract angle-based features Identifying bent double galaxies in the FIRST survey Comparisons in Richtmyer-Meshkov instability Extract features representing texture, shape,... Similarity-based object retrieval for simulations Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 24 / 39

Data mining techniques in uncertainty quantification Reducing the dimension Reducing the dimension of the data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 25 / 39

Data mining techniques in uncertainty quantification Reducing the dimension There are several reasons we need to reduce the dimension The dimension of a problem = number of features describing an object We want to identify the important variables to monitor. Lower dimensional data are easier to visualize and understand. Irrelevant features may lower the accuracy of classifiers. More samples are needed to cover a high dimensional space. Nearest-neighbor may lose meaning in high dimensional spaces. The computational cost to extract, store, and use the features may be a consideration. Data structures for fast searches do not scale to high dimensions. Domain scientists can also identify features that are likely to be important. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 26 / 39

Data mining techniques in uncertainty quantification Reducing the dimension Dimension reduction can be done by feature selection,... Feature selection techniques are used in classification and regression. Distance filter Filter methods (distance filter, stump filter, Relief,...) evaluate how well the features discriminate among the classes Relief Wrapper methods use the model to evaluate the subset selected: Forward search Backward search Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 27 / 39

Data mining techniques in uncertainty quantification Reducing the dimension or by using feature transform methods Principal component analysis is a commonly used linear technique Non-linear methods include: Isomap, LLE, Laplacian eigenmaps,... Random projections (the Johnson-Lindenstrauss lemma) are an option. The connection to the original features is no longer clear. The non-linear techniques are based on nearest neighbors. Some methods are computationally expensive. The intrinsic dimensionality of the data is of concern. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 28 / 39

Data mining techniques in uncertainty quantification Reducing the dimension Our experiences indicate feature selection works well FirstTriples Wind115 0.18 0.17 0.16 0.15 stump filter distance filter chi-squared filter isomapy_ep36 isomapy_k11 lley_k41 lapy_ep36_t5 0.126154 0.42 0.4 0.38 stump filter distance filter chi-squared filter isomapy_ep28 lley_ep28 lapy_ep22_t5 0.260548 Percentage Error Rate 0.14 0.13 0.12 0.11 Percentage error rate 0.36 0.34 0.32 0.3 0.1 0.28 0.09 0.26 0.08 2 4 6 8 10 12 14 16 18 20 Number of Features 0.24 5 10 15 20 Number of features Identifying bent-double galaxies in the FIRST survey Identifying weather conditions associated with ramp events in wind generation Feature selection techniques tend to be more accurate, give interpretable results, and are computationally less expensive. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 29 / 39

Data mining techniques in uncertainty quantification Building the models Building the models Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 30 / 39

Data mining techniques in uncertainty quantification Building the models We can build different kinds of models from data Scaling laws Predictive models (classification/regression) Descriptive models (clustering) Power law model for bubble counts in Rayleigh-Taylor instability Examples: decision trees, support vector machines, neural networks, locally weighted learning, naïve Bayes,..., and ensembles Examples: k-means, agglomerative and divisive methods, graph-based approaches... We have a choice of many algorithms. In our experience, the results are dependent more on the data and the pre-processing, than the algorithm. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 31 / 39

Data mining techniques in uncertainty quantification Building the models There are several considerations in building a model Classification: unbalanced training set incorrect labels need for interpretable models Clustering: defining a similarity metric Scaling laws: confirming that it is indeed a power law Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 32 / 39

Data mining techniques in uncertainty quantification Generating the data The role of analysis techniques in generating the data Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 33 / 39

Data mining techniques in uncertainty quantification Generating the data There is also interest in the process of generating the data Motivation: low quality of data, easy availability of computational resources, new kinds of problems being addressed,... There is a large amount of unlabeled data, but labeling is expensive - which samples should we label? We want to run simulations to understand how an experiment will perform - what should be the inputs? How do we ensure there is enough variation in the ensembles? We want to understand the design space to create a new material using simulations - where should we place our sample points? Inverse problems - what input should I use to get a desired output? My simulations are expensive; is there a surrogate I can use? The intent is to close the gap between data acquisition and model building. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 34 / 39

Data mining techniques in uncertainty quantification Generating the data Several ideas from data mining are relevant Active learning, semi-supervised learning, relevance feedback,... Code surrogates, meta-modeling,... Dimension reduction Multi-task learning Sampling Design of (computational) experiments This is an active area of research with many ideas being explored. Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 35 / 39

Closing thoughts Closing Thoughts Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 36 / 39

Closing thoughts Caveats and notes of caution Scientific data mining is a careful and considered application of techniques in close collaboration with the domain scientists. A blind use of techniques/software is not recommended. Data pre-processing is a critical, but time-consuming, part of the process. Using more than one technique to solve a problem is helpful. Analysis results must be interpreted carefully before making a decision. Many different fields contribute to data mining, each with its own perspectives and insights. The first principle is that you must not fool yourself - and you are the easiest person to fool. Richard P. Feynmann Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 37 / 39

Closing thoughts Challenges and current topics of research Streaming data: concept drift, real-time analysis,... Size of the data, exa-scale computing,... Complex data: multivariate, multi-sensor, multi-scale, multi-modal,... New analysis problems: application of techniques to generation of data, problems in simulation data,... Reasoning and decision making in the presence of uncertainty... Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 38 / 39

Acknowledgments Closing thoughts Data miners and software developers: Erick Cantú-Paz, Samson Cheung, Ya Ju Fan, Abel Gezahegne, Cyrus Harrison, Thinh Nguyen, Nu Ai Tang,... Domain scientists: Robert Becker and the FIRST astronomers, Joshua Breslau, Omar Hurricane, Jeffrey Greenough, Jeffrey Jacobs, Scott Klasky, Zhihong Lin, Paul Miller, Donald Monticello, Neil Pomphrey,... Funding sources: ASC program (DOE), ASCR Base Research Program (DOE), the LDRD program at Lawrence Livermore National Laboratory, SciDAC program (DOE), and WHTP/EERE (DOE). For more details https://computation.llnl.gov/casc/sapphire/ https://computation.llnl.gov/casc/starsapphire/ Chandrika Kamath, kamath2@llnl.gov Chandrika Kamath (LLNL) Data Mining & Uncertainty Quantification 39 / 39