Monday Morning Data Mining



Similar documents
Social Media Mining. Data Mining Essentials

Knowledge Discovery and Data Mining

Data Mining with Weka

Data Mining. Nonlinear Classification

Knowledge Discovery and Data Mining

Data Mining Practical Machine Learning Tools and Techniques

Chapter 6. The stacking ensemble approach

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine learning for algo trading

Maschinelles Lernen mit MATLAB

Using multiple models: Bagging, Boosting, Ensembles, Forests

Knowledge-based systems and the need for learning

Chapter 12 Discovering New Knowledge Data Mining

8. Machine Learning Applied Artificial Intelligence

Machine Learning using MapReduce

Supervised Learning (Big Data Analytics)

Data Mining on Streams

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Model Combination. 24 Novembre 2009

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

MS1b Statistical Data Mining

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Comparison of Data Mining Techniques used for Financial Data Analysis

CS570 Data Mining Classification: Ensemble Methods

How To Solve The Kd Cup 2010 Challenge

An Approach to Detect Spam s by Using Majority Voting

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Decision-Tree Learning

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Chapter 7. Cluster Analysis

Environmental Remote Sensing GEOG 2021

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Random forest algorithm in big data environment

Implementation of Breiman s Random Forest Machine Learning Algorithm

Supervised Feature Selection & Unsupervised Dimensionality Reduction

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Sentiment analysis using emoticons

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Azure Machine Learning, SQL Data Mining and R

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

How To Perform An Ensemble Analysis

Clustering Connectionist and Statistical Language Processing

An Overview of Knowledge Discovery Database and Data mining Techniques

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Essentials

MHI3000 Big Data Analytics for Health Care Final Project Report

Machine Learning Capacity and Performance Analysis and R

RapidMiner. Business Analytics Applications. Data Mining Use Cases and. Markus Hofmann. Ralf Klinkenberg. Rapid-I / RapidMiner.

Data Mining of Web Access Logs

More Data Mining with Weka

Journal of Asian Scientific Research COMPARISON OF THREE CLASSIFICATION ALGORITHMS FOR PREDICTING PM2.5 IN HONG KONG RURAL AREA.

Final Project Report

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Active Learning SVM for Blogs recommendation

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Advanced Ensemble Strategies for Polynomial Models

Using Data Mining for Mobile Communication Clustering and Characterization

Pentaho Data Mining Last Modified on January 22, 2007

Classification of Bad Accounts in Credit Card Industry

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

E-commerce Transaction Anomaly Classification

Predicting Student Performance by Using Data Mining Methods for Classification

Hadoop SNS. renren.com. Saturday, December 3, 11

Microsoft Azure Machine learning Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Why Ensembles Win Data Mining Competitions

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Chapter ML:XI (continued)

Classification Using Data Reduction Method

Data Mining and Visualization

Predicting borrowers chance of defaulting on credit loans

How To Cluster

Knowledge Discovery and Data Mining

Transcription:

Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse

Outline: - data mining - IceCube - Data mining in IceCube

Computer Scientists are different... Fakultät Physik

Fakultät Physik

Fakultät Physik

Building a model and predicting the outcome:

Can be broken down to 4 (simple) steps: 1. Find representation of data 2. Find a good algorithm 3. Validate your results 4. Apply on data

IceCube in a nutshell: - completed in December 2010 - located at the geographic South Pole - 5160 Digital Optical Modules on 86 strings - instrumented volume of 1 km 3 - subdetectors DeepCore and IceTop

IceCube in a nutshell: - Detection principle: Cherenkov light - Look for events of the form: ν + X e,µ,τ - Dominant background of atm. µ Use earth as a filter (select upgoing events only)

IceCube: Scientific goals - detection of astrophysical neutrinos - atmospheric neutrino energy spectrum - neutrino oscillations - CR-anisotropy - exotic stuff

Fakultät Physik

Fakultät Physik

Fakultät Physik

Fakultät Physik

Fakultät Physik

Fakultät Physik

Fakultät Physik

Data Mining in IceCube: - app. 2600 reconstructed attributes - Data and MC do not necessarily agree - signal/background ratio ~ 10-3 interesting for studies within the scope of machine learning

1. Finding a good representation of your data

Make sure you understand your input: Attributes can be: nominal green, blue, red, yellow ordinal cool, mild, hot cool < mild < hot numerical 1,2,3,4,... labels can be: polynominal red, green, yellow, blue binominal signal, background numerical 1,2,3...,5000,...

Data Preprocessing: Preselection of parameters 1. Check for consistency (data vs.signal MC vs. Backgr. MC) 2. Check for missing values (nans, infs) How to handle the nans? (see next slide) 3. Eliminate the obvious (Azimuth angle, timing information...) 4. Eliminate highly correlated and constant parameters

Data and MC preprocessing: How to handle nans? Several possibilities: - Exclude attributes that exceed a certain number of nans - Replace by: - minimum - maximum - average - nothing at all - (median...)

Data and MC preprocessing: Feature Selection 1. Forward Selection start with empty selection add each unused attribute estimate performance add attribute with highest increase in performance start new round

Data and MC preprocessing: Feature Selection 2. Backward Elimination start with a full set of attributes Remove each of the attributes Estimate performance for each removed attribute The attribute giving the least decrease in performance is removed start new round

Backward Elimination in RapidMiner: Fakultät Physik

Data and MC preprocessing: Feature Selection 3. Mininmum Redundancy Maximum Relevance iteratively add features with biggest relevance and least redundancy Quality criterion Q: 1 Q = R( x, y) D( x, x) j x in R: Relevance; D: Redundancy; F j = already selected features F j

MRMR in RapidMiner:

Evaluating the Stability of the Parameter Selection: - Data and MC is subject to a certain variance this variance does influence the parameter selection!

Stability of the MRMR Selection: Jaccard Index: Kuncheva s Index: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.6458&rep=rep1&type=pdf B A B A J = ) ( ), ( 2 B A r k B A k n k k rn B A I C = = = =

Fakultät Physik

2. Learning algorithms

Learners: 1. Decision Trees 2. Naive Bayes 3. k - Nearest Neighbours 4. Random Forests 5. Boosting

A bit more technically speaking: set of vectors x = (x 1,x 2,...,x n ); x i = attribute (attributes = features, variables, parameters) labels y 1,y 2,...,y n labels = target class create a model f from your examples, that predicts a y for a given x.

Constructing a simple model:

Decision Trees: Simple Classifier!

Naive Bayes: - based on Bayes theorem: Pr[ H E] = Pr[ E H ] Pr[ H Pr[ E] ] - assumes all attributes are independent

Naive Bayes: Golf data

Naive Bayes: Play? outlook = sunny, temperature = cool, humidity = high, windy = true

Naive Bayes: Pr[ yes E] = 2 / 9 3/ 9 3/ 9 Pr[ E] 3/ 9 9 /14

Naive Bayes: Pr[ yes Pr[ yes E] = E] = Pr[ no E] = 2 / 9 3/ 0.0053 0.0206 9 3/ 9 3/ 9 9 /14 Pr[ E] needs to be normalized! Pr[ yes E] = 0.205 Pr[ no E] = 0.795

Naive Bayes: What if Pr[E i yes]=0? Let s assume we don not have positive examples for outlook = rainy Pr[ sunny yes] = 4 / 9 Pr[ sunny yes] = 5/12 Pr[ overcast yes] = 5/ 9 Pr[ overcast yes] = 6 /12 Pr[ rainy yes] = 0 / 9 Pr[ rainy yes] = 1/12 Use Laplace correction!

k-nearest Neighbours (k-nn) - memory based classifier - unsupervised - find the k neighbours closest to x and classify by majority vote - all features should be normalized

Random Forests: - ensemble of decision trees - developed by Leo Breiman (2001) - no boosting between individual trees - events are classified by individual trees - uses average for final classification 1 n trees s = n trees i= 0 s i

Random Forests: Output MC scaled to data expectations choose final cut on signalness

Random Forests in rapidminer

Weka Random Forest:

Boosting: - uses an ensemble of weak classifiers (decision trees) - weights are increased for false classified events - weighted vote is applied - each classifier depends on the performance of the previous ones

Fakultät Physik

AdaBoost in rapidminer

Fakultät Physik

3. Validating the results

Split Validation:

Cross Valdiation:

Split Validation vs. Cross Validation: Fakultät Physik

Cross validated predictions: Cut Nugen Corsika Sum 0.990 4817 ± 44 93 ± 38 4910 ± 58 0.992 4633 ± 43 80 ± 30 4633 ± 52 0.994 4414 ± 41 57 ± 30 4414 ± 51 0.996 4122 ± 32 49 ± 26 4122 ± 41 0.998 3695 ± 46 18 ± 17 3695 ± 49 1.000 2932 ± 33 4 ± 9 2932 ± 34

Cross Validation for a limited number of examples? YES! Leave One Out!

4. Application on data

Change the Scaling of the Corsika: Fakultät Physik such that it matches data for Signalness > 0.2

Data/MC mismatch: Underestimation of Background

Application of RF on 10% of data: Cut Nugen Corsika Sum Data 0.990 4817 ± 44 93 ± 38 4910 ± 58 4988 0.992 4633 ± 43 80 ± 30 4633 ± 52 4757 0.994 4414 ± 41 57 ± 30 4414 ± 51 4476 0.996 4122 ± 32 49 ± 26 4122 ± 41 4134 0.998 3695 ± 46 18 ± 17 3695 ± 49 3638 1.000 2932 ± 33 4 ± 9 2932 ± 34 2833

Possible Improvements: Ensembles

Hierarchical Clustering: Agglomerative Fakultät Physik

Hierarchical Clustering: Divisive

k-means Clustering: - Pick mean at random - Calculate distance of examples to mean - assign examples to cluster - recalculate mean of the cluster - reiterate until mean does not change any longer Significantly faster than hierarchical clustering Have to know k in advance...

Careful when using clusters: Normalize!!!

Summary: - IceCube is interesting for detailed studies in machine learning - studies can be carried out using RapidMiner - MRMR for Feature Selection - Simple learners are good for benchmarks - Cross Validation is good for you! - Signal/Background separation using data mining is possible!

Fakultät Physik

Fakultät Physik