Data mining on the rocks T. Ruhe for the IceCube collaboration, K. Morik GREAT workshop on Astrostatistics and data mining 2011

Similar documents

Data Mining Ice Cubes Tim Ruhe, Katharina Morik ADASS XXI, Paris 2011

Monday Morning Data Mining

Big Data Analytics in Astrophysics. Katharina Morik, Informatik 8, TU Dortmund

Model Combination. 24 Novembre 2009

Data Mining. Nonlinear Classification

Applied Multivariate Analysis - Big data analytics

MHI3000 Big Data Analytics for Health Care Final Project Report

Using multiple models: Bagging, Boosting, Ensembles, Forests

Knowledge Discovery and Data Mining

Content-Based Recommendation

Chapter 6. The stacking ensemble approach

Decision Trees from large Databases: SLIQ

CS570 Data Mining Classification: Ensemble Methods

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining Practical Machine Learning Tools and Techniques

Knowledge Discovery and Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Social Media Mining. Data Mining Essentials

MS1b Statistical Data Mining

Predicting Flight Delays

Low-energy point source searches with IceCube

Data Mining Analysis (breast-cancer data)

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Getting Even More Out of Ensemble Selection

Predicting borrowers chance of defaulting on credit loans

Why Ensembles Win Data Mining Competitions

Beating the MLB Moneyline

IceCube Project Monthly Report July 2007

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

PoS(ICRC2015)707. FACT - First Energy Spectrum from a SiPM Cherenkov Telescope

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

The Operational Value of Social Media Information. Social Media and Customer Interaction

Machine learning for algo trading

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Cross-validation for detecting and preventing overfitting

Predict the Popularity of YouTube Videos Using Early View Data

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Active Learning SVM for Blogs recommendation

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

HT2015: SC4 Statistical Data Mining and Machine Learning

Gerry Hobbs, Department of Statistics, West Virginia University

Knowledge Discovery and Data Mining

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

New Ensemble Combination Scheme

Fraud Detection for Online Retail using Random Forests

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Ensemble Approach for the Classification of Imbalanced Data

Supervised Learning (Big Data Analytics)

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

An Introduction to Data Mining

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Ensuring Collective Availability in Volatile Resource Pools via Forecasting

Classification of Bad Accounts in Credit Card Industry

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Microsoft Azure Machine learning Algorithms

LCs for Binary Classification

Welcome to this presentation on Driving LEDs AC-DC Power Supplies, part of OSRAM Opto Semiconductors LED Fundamentals series. In this presentation we

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Distributed forests for MapReduce-based machine learning

Azure Machine Learning, SQL Data Mining and R

TURKISH ORACLE USER GROUP

Advanced Ensemble Strategies for Polynomial Models

MACHINE LEARNING IN HIGH ENERGY PHYSICS

T(CR)3IC Testbed for Coherent Radio Cherenkov Radiation from Cosmic-Ray Induced Cascades

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Data Mining Techniques Chapter 6: Decision Trees

The Artificial Prediction Market

Better credit models benefit us all

How To Perform An Ensemble Analysis

Introducing diversity among the models of multi-label classification ensemble

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Implementation of Breiman s Random Forest Machine Learning Algorithm

Data Mining Classification: Decision Trees

Data Mining Techniques for Prognosis in Pancreatic Cancer

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining for Knowledge Management. Classification

E-commerce Transaction Anomaly Classification

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Solid State Detectors = Semi-Conductor based Detectors

IceCube Project Monthly Report April 2006

Evaluation Tools for the Performance of a NESTOR Test Detector

Neural Networks and Support Vector Machines

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

CART 6.0 Feature Matrix

Data Mining Part 5. Prediction

Scalable Machine Learning to Exploit Big Data for Knowledge Discovery

A Property and Casualty Insurance Predictive Modeling Process in SAS

Sentiment analysis using emoticons

Predictive Data modeling for health care: Comparative performance study of different prediction models

How To Solve The Kd Cup 2010 Challenge

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

On Cross-Validation and Stacking: Building seemingly predictive models on random data

Real Time Data Analytics Loom to Make Proactive Tread for Pyrexia

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Comparison of machine learning methods for intelligent tutoring systems

Transcription:

Data mining on the rocks T. Ruhe for the IceCube collaboration, K. Morik GREAT workshop on Astrostatistics and data mining 2011

Outline: - IceCube, detector and detection principle - Signal and Background - Feature Selection and Performance - Feature Selection stability - Random Forest, training and testing - Summary and Outlook

IceCube: The detector - completed in December 2010 - located at the geographic South Pole - 5160 Digital Optical Modules on 86 strings - instrumented volume of 1 km 3 - subdetectors DeepCore and IceTop

IceCube detection principle: Cherenkov light detected by Digital Optical Modules (DOMs)

IceCube-operation: - IceCube has taken data in various configurations - This study: 59-strings - Pacman-like geometry - reconstruction based on timing information by the DOMS

Signal and Background: Fakultät Physik Signal: atmospheric v (+ v from astrophysical sources) ~ 14180 events in 33.28 days of IC-59 Backgr.: (misreconstructed) atmospheric muons ~ 9.699 x 10 6 events in 33.28 days of IC-59 Signal to Background ratio: 1.46 x 10-3 Application of Precuts: Signal: 10419 events in 33.28 days of IC-59 Backgr.: 1.68 x 10 6 events in 33.28 days of IC-59 Ratio = 6.2 x 10-3

RapidMiner: - Formerly known as YALE (Yet Another Learning Environment) - Developed at TU Dortmund University (group of K. Morik) - Operator based - Open Source, written in java - Weka operators are fully included - Quite intuitive to use (personal opinion)

Preselection of parameters: 1. Check for consistency (data vs. simulation) Eliminate if missing in one (reduction ~ 10 20 out of ~2600) 2. Check for missing values (nans, infs) Eliminate if number of missing values exceeds 30% (reduction to 1408 attributes) 3. Eliminate the obvious (Azimuth, Time, galactic coordinates...) (reduction to 612 attributes) 4. Eliminate highly correlated (ρ = 1.0 ) and constant parameters Final set of 477 parameters

MRMR-Feature Selection: - Maximum-Relevance-Minimum-Redundancy - iteratively adds to a set F j a feature according to a quality criterion Q: Q = rel( y, x) red ( x, x ) x - Relevance and Redundancy automatically map to linear correlation, F-test score or mutual information (depending on attribute and class type) F j Q = x rel( y, x) red( x, x ) F j

Feature Selection and Performance: - MRMR Feature Selection - Random Forest (from the Weka package) Fakultät Physik n trees matched to n Attr. such that: n trees = 10 x n Attr - Naive Bayes - k-nn, k =5, weighted vote, mixed Eucledian distance - 10-fold cross validation - 3.4 x 10 5 simulated signal and background events

Fakultät Physik

Stability of the MRMR Selection: Jaccard Index: Kuncheva s Index: B A B A J = ) ( ), ( 2 B A r k B A k n k k rn B A I C = = = =

Removing further correlations: - after MRMR still some correlated attributes Study the performance if further correlations are removed

Training and Testing a Random Forest: - 500 trees, Weka Random Forest - 3.4 x 10 5 simulated signal events - 3.4 x 10 5 simulated background events - 5-fold cross validation - 2.8 x 10 4 signal and background events for training Avoid possible overfitting

Fakultät Physik

Forest Performance and Results: Fakultät Physik Background computed using a conservative estimate Cut Exp. Bckgr. Ev. Exp. Sig. Ev. Est. Pur. [%] 0.990 311 5079 94.2 0.992 263 4864 94.9 0.994 215 4606 95.5 0.996 139 4271 96.8 0.998 118 3804 97.0 1.000 77 3017 97.5

Summary and Outlook: - influence of detailed FS on training of multivariate classifiers has been studied - MRMR is stable if n Attr. 20 - most stable performance of the forest when attributes with ρ > 0.9 removed prior to MRMR - Random Forest trained and tested - > 3000 events are found, Purity routinely > 95 % - Not fully optimized - hope to achieve even better with a fully optimized classifier

Backup Slides

The Random Forest: - Developed by Leo Breiman (2001) - Utilizes ensemble of decision trees Forest - No boosting between individual trees - Average used for final classification - No overtraining! (see: Machine Learning, 45, 5 32 (2001)) - Here: Rapidminer toolkit (developed in Dortmund)

Atmospheric neutrinos: A detailed understanding of the atmospheric neutrino spectrum is crucial for the detection of an astrophysical flux.