Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries

Similar documents
Decision Trees from large Databases: SLIQ

Data Mining. Nonlinear Classification

Data Mining Practical Machine Learning Tools and Techniques

Feature Subset Selection in Spam Detection

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Chapter 6. The stacking ensemble approach

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Gerry Hobbs, Department of Statistics, West Virginia University

CSC574 - Computer and Network Security Module: Intrusion Detection

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Classification of Bad Accounts in Credit Card Industry

Evaluation & Validation: Credibility: Evaluating what has been learned

Using Random Forest to Learn Imbalanced Data

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Comparison of Data Mining Techniques used for Financial Data Analysis

How To Make A Credit Risk Model For A Bank Account

How To Cluster

Intrusion Detection via Machine Learning for SCADA System Protection

Classification and Regression by randomforest

1. Classification problems

Better credit models benefit us all

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Random Forest Based Imbalanced Data Cleaning and Classification

Knowledge Discovery and Data Mining

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Data Mining Techniques for Prognosis in Pancreatic Cancer

Fraud Detection for Online Retail using Random Forests

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Expert Systems with Applications

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

Supervised Feature Selection & Unsupervised Dimensionality Reduction

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Introduction to Learning & Decision Trees

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining Methods: Applications for Institutional Research

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Knowledge Discovery and Data Mining

Chapter 12 Discovering New Knowledge Data Mining

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Why Ensembles Win Data Mining Competitions

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

ANALYTICS IN BIG DATA ERA

A Decision Tree for Weather Prediction

Learning from Diversity

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Classification and Prediction

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Analecta Vol. 8, No. 2 ISSN

Keywords data mining, prediction techniques, decision making.

The Artificial Prediction Market

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Neural Networks and Support Vector Machines

Supervised Learning (Big Data Analytics)

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Decision Support Systems

Datamining. Gabriel Bacq CNAMTS

An Approach to Detect Spam s by Using Majority Voting

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

The Operational Value of Social Media Information. Social Media and Customer Interaction

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY. Harrison H. Barrett University of Arizona Tucson, AZ

Social Media Mining. Data Mining Essentials

Random forest algorithm in big data environment

Environmental Remote Sensing GEOG 2021

MS1b Statistical Data Mining

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Characterization and prediction of issue-related risks in software projects

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

A Content based Spam Filtering Using Optical Back Propagation Technique

FRAUD DETECTION IN MOBILE TELECOMMUNICATION

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining Algorithms Part 1. Dejan Sarka

On the application of multi-class classification in physical therapy recommendation

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

International Journal of Advance Research in Computer Science and Management Studies

Statistical Machine Learning

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Transcription:

Searching for Gravitational Waves from the Coalescence of High Mass Black Hole Binaries 2015 SURE Presentation September 22 nd, 2015 Lau Ka Tung Department of Physics, The Chinese University of Hong Kong Mentors: Surabhi Sachdev, Tjonnie Li, Kent Blackburn, Alan Weinstein LIGO Laboratory, California Institute of Technology LIGO Scientific Collaboration 1

Objectives Improve signal to noise discrimination using machine learning LIGO Scientific Collaboration 2

Machine Learning The computer is presented with example inputs and their desired outputs, given by a teacher, and the goal is to learn a general rule that maps inputs to outputs. The essence of machine learning A pattern exist. We cannot pin it down mathematically. We have data on it. 3

Example of machine learning Question: 1 or 5? Features: Intensity, symmetry Training: Intensity Symmetry 1 or 5 3.3 0.5 5 0.8 4.5 1 5.6 0.3 5 Classifier: Evaluation: Intensity Symmetry 1 or 5 0.6 3.8? 2.7 1.2? 1 5 4

Use machine learning for signal to noise discrimination Question: Signal (1) or noise (0)? Features: Training: mass1 SNR 0/1 1.5 18.5 0.9 1 15.7 35.6 3.5 1 7.2 5.4 4.2 0 Classifier: Evaluation: Background Signal mass1 SNR 0/1 27.8 37.3 3.8? 3.4 6.5 6.7? 5

gstlal pipeline 36

Work flow of machine learning in ranking 7

Training data Signals: Simulated signal injections are used to train as signals Background: Construct coincident triggers from single detector triggers to train as background H1 t m1 m2 s1 s2 1.3 4.2 3.6-0.2-0.3 3.7 5.3 1.9-0.1 0.1 4.2 10.3 6.4 0.8 0.0 4.9 27.3 17.5 0.1 0.4 5.8 3.2 3.1-0.4-0.9 7.4 8.9 6.4 0.2 0.3 L1 t m1 m2 s1 s2 0.3 8.7 3.4-0.4-0.5 1.6 3.2 3.1-0.4-0.9 2.9 6.9 3.2 0.2-0.6 3.2 15.7 12.4 0.8 0.9 4.5 5.3 1.9-0.1 0.1 6.4 35.4 18.5-0.3-0.2 8

Learning algorithm (Classifier) Artificial Neural Network Support Vector Machine Random Forest 9

Decision Tree Training setevent >4 Find a feature and threshold to optimize some criterion Feature vector Split set Leaf >8 Leaf Splitting no longer optimize criterion/number of events at node < l Leaf contains signals, background 10

Random Forest of Bootstrap aggregated Decision Tree (RFBDT) algorithm 1 Decision Tree: weak classifier Many (forest): good classifier Bootstrap AGGregatING (Bagging) Each tree trains with subset of whole data set. 11

Ranking statistic from RFBDT Probability of event given signal signal: 8 background: 3 p = 0.73 12

Tuning of RFBDT Parameter Features Description Characteristic of an event. Number of decision trees The number of decision trees in a forest. Number of sampled features The number of sampled parameters which are chosen randomly to from a subset of original feature vector. Minimal entires per leaf When the number of events in a node reaches the minimum leaf size, the data stop from splitting into two nodes and the node becomes a leaf. 13

Binary Classification Classification False Alarm Probability: Class 1 False Positive (FP) True Positive (TP) threshold eg. 0.4 Class 0 True Negative (TN) False Negative (FN) True Positive Probability (Efficiency): Background (Class 0) True Class Signal (Class 1) 14

Expand feature space by transforming to features with physical meaning 15

9 Features: Comparison using ROC curve 12 Features: 14 Features: 16 Features: 16

Additional tuning num of trees min entries per leaf Num of trees 100 Num of sampled features 4 Min entries per leaf 5 Optimization criterion num of sampled features optimization criterion Gini index 17

Pipeline for calculating likelihood from RFBDT 18

Future work Compare performance of RFBDT ranking with likelihood-ratio ranking currently use in gstlal Systematic way to select features e.g. RELIEFF Principal component analysis (PCA): extract linear transform of original features Choose RFBDT options by validation automatically in pipeline Include data quality information from other channels 19

Acknowledgements I would like to thank my mentors: Tjonnie, Alan, Surabhi, Kent I would like to thank Prof. Chu LIGO Scientific Collaboration Caltech SURF NSF Department of Physics, CUHK 20

Backup Slides 21

Small in-sample error, huge out of sample error data: 2nd order polynomial + noise fit the data with 10th order polynomial Overtraining/overfitting fitting curve passed through all points, in-sample error = 0 out of sample error is huge poor generalization 22

Optimization Criterion: Gini index Gini index is large when we have ~equal amount of signals and background at node Classification: want only signal/background (1 class) at node => minimize Gini index 23

Tunable options in RFBDT 24

25

26

27