MACHINE LEARNING IN HIGH ENERGY PHYSICS

Similar documents
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Machine Learning

Lecture 3: Linear methods for classification

Linear Classification. Volker Tresp Summer 2015

Azure Machine Learning, SQL Data Mining and R

Classification Problems

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CSCI567 Machine Learning (Fall 2014)

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

MACHINE LEARNING BASICS WITH R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

: Introduction to Machine Learning Dr. Rita Osadchy

Machine Learning using MapReduce

STA 4273H: Statistical Machine Learning

Machine Learning in Spam Filtering

Christfried Webers. Canberra February June 2015

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

MS1b Statistical Data Mining

Machine learning for algo trading

Linear Threshold Units

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

How To Cluster

Advanced analytics at your hands

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Big Data Analytics CSCI 4030

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining - Evaluation of Classifiers

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning for Data Science (CS4786) Lecture 1

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Predict Influencers in the Social Network

Classification algorithm in Data mining: An Overview

Machine Learning in Python with scikit-learn. O Reilly Webcast Aug. 2014

3F3: Signal and Pattern Processing

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Machine Learning and Pattern Recognition Logistic Regression

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Car Insurance. Prvák, Tomi, Havri

Basics of Statistical Machine Learning

Question 2 Naïve Bayes (16 points)

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

MA2823: Foundations of Machine Learning

Tagging with Hidden Markov Models

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Lecture 9: Introduction to Pattern Analysis

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

HT2015: SC4 Statistical Data Mining and Machine Learning

Content-Based Recommendation

An Introduction to Machine Learning

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Similarity Search in a Very Large Scale Using Hadoop and HBase

Detecting Corporate Fraud: An Application of Machine Learning

Bayes and Naïve Bayes. cs534-machine Learning

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Predictive Data modeling for health care: Comparative performance study of different prediction models

Machine Learning Final Project Spam Filtering

Machine Learning Logistic Regression

Classification Techniques for Remote Sensing

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Environmental Remote Sensing GEOG 2021

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Supervised Learning (Big Data Analytics)

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Predicting outcome of soccer matches using machine learning

Anomaly detection. Problem motivation. Machine Learning

Principles of Data Mining by Hand&Mannila&Smyth

Social Media Mining. Data Mining Essentials

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Neural Networks Lesson 5 - Cluster Analysis

Reject Inference in Credit Scoring. Jie-Men Mok

Data, Measurements, Features

Spam Detection A Machine Learning Approach

Obtaining Value from Big Data

Multi-Class and Structured Classification

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Data Mining: Algorithms and Applications Matrix Math Review

An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY. Harrison H. Barrett University of Arizona Tucson, AZ

Statistical Models in Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining. Nonlinear Classification

The Scientific Data Mining Process

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Machine Learning with MATLAB David Willingham Application Engineer

Introduction to Learning & Decision Trees

Big Data Paradigms in Python

TOWARD BIG DATA ANALYSIS WORKSHOP

1. Classification problems

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Transcription:

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015

INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

WHAT IS ML ABOUT? Inference of statistical dependencies which give us ability to predict Data is cheap, knowledge is precious

WHERE ML IS CURRENTLY USED? Search engines, spam detection Security: virus detection, DDOS defense Computer vision and speech recognition Market basket analysis, Customer relationship management (CRM) Credit scoring, fraud detection Health monitoring Churn prediction... and hundreds more

ML IN HIGH ENERGY PHYSICS 40MHz 5kHz High-level triggers (LHCb trigger system: ) Particle identification Tagging Stripping line Analysis Different data is used on different stages

GENERAL NOTION In supervised learning the training data is represented as set of pairs x i, y i i is index of event xi is vector of features available for event y i is target the value we need to predict

CLASSIFICATION EXAMPLE y i Y Y, if finite set on the plot:, x i R 2 {0, 1, 2} Examples: defining type of particle (or decay channel) Y = {0, 1} binary classification, 1 is signal, 0 is bck y i

REGRESSION y R Examples: predicting price of house by it's positions predicting number of customers / money income reconstructing real momentum of particle Why need automatic classification/regression? in applications up to thousands of features higher quality much faster adaptation to new problems

CLASSIFICATION BASED ON NEAREST NEIGHBOURS Given training set of objects and their labels predict the label for new observation. {, } x i y i we y = y j, j = arg min ρ(x, ) i x i

VISUALIZATION OF DECISION RULE

k NEAREST NEIGHBOURS A better way is to use k neighbours: p i (x) = # of knn events in class i k

k = 1, 2, 5, 30

OVERFITTING what is the quality of classification on training dataset when k = 1? answer: it is ideal (closest neighbor is event itself) quality is lower when k > 1 this doesn't mean k = 1 is the best, it means we cannot use training events to estimate quality when classifier's decision rule is too complex and captures details from training data that are not relevant to distribution, we call this overfitting (more details tomorrow)

KNN REGRESSOR Regression with nearest neighbours is done by averaging of output y = 1 k y j j knn(x)

KNN WITH WEIGHTS

COMPUTATIONAL COMPLEXITY Given that dimensionality of space is training samples: d and there are n training time ~ O(save a link to data) prediction time: for each sample n d

SPACIAL INDEX: BALL TREE

BALL TREE training time ~ prediction time ~ O(d n log(n)) log(n) d Other option exist: KD-tree. for each sample

OVERVIEW OF KNN 1. Awesomely simple classifier and regressor 2. Have too optimistic quality on training data 3. Quite slow, though optimizations exist 4. Hard times with data of high dimensions 5. Too sensitive to scale of features

SENSITIVITY TO SCALE OF FEATURES Euclidean distance: ρ(x, y ) 2 = ( x 1 y 1 ) 2 + ( x 2 y 2 ) 2 + + ( x d y d ) 2 Change scale fo first feature: ρ(x, y ) 2 = (10x 1 10 y 1 ) 2 + ( x 2 y 2 ) 2 + + ( x d y d ) 2 ρ(x, y) 2 100( x 1 y 1 ) 2 Scaling of features frequently increases quality.

DISTANCE FUNCTION MATTERS Minkowski distance Canberra Cosine metric ρ p (x, y) = i ( x i y i ) p x ρ(x, y) = i y i i x i + y i ρ(x, y) = < x, y > x y

x MINUTES BREAK

RECAPITULATION 1. Statistical ML: problems 2. ML in HEP 3. nearest neighbours classifier and regressor. k

MEASURING QUALITY OF BINARY CLASSIFICATION The classifier's output in binary classification is real variable Which classifier is better? All of them are identical

ROC CURVE These distributions have the same ROC curve: (ROC curve is passed signal vs passed bck dependency)

ROC CURVE DEMONSTRATION

ROC CURVE Contains important information: all possible combinations of signal and background efficiencies you may achieve by setting threshold Particular values of thresholds (and initial pdfs) don't matter, ROC curve doesn't contain this information ROC curve = information about order of events: s s b s b... b b s b b Comparison of algorithms should be based on information from ROC curve

TERMINOLOGY AND CONVENTIONS fpr = background efficiency = b tpr = signal efficiency = s

ROC AUC (AREA UNDER THE ROC CURVE) ROC AUC = P(x < y) x, y where are predictions of random background and signal events.

Which classifier is better for triggers? (they have the same ROC AUC)

STATISTICAL MACHINE LEARNING Machine learning we use in practice is based on statistics 1. Main assumption: the data is generated from probabilistic distribution: p(x, y) 2. Does there really exist the distribution of people / pages? 3. In HEP these distributions do exist

OPTIMAL CLASSIFICATION. OPTIMAL BAYESIAN CLASSIFIER Assuming that we know real distributions reconstruct using Bayes' rule p(x, y) we p(x, y) p(y x) = = p(x) p(y)p(x y) p(x) p(y = 1 x) p(y = 0 x) = p(y = 1) p(x y = 1) p(y = 0) p(x y = 0) LEMMA (NEYMAN PEARSON): p(y = 1 x)

The best classification quality is provided by (optimal bayesian classifier) p(y = 0 x) OPTIMAL BINARY CLASSIFICATION Optimal bayesian classifier has highest possible ROC curve. Since the classification quality depends only on order, gives optimal classification quality too! p(y = 1 x) p(y = 1 x) p(y = 0 x) = p(y = 1) p(x y = 1) p(y = 0) p(x y = 0)

FISHER'S QDA (QUADRATIC DISCRIMINANT ANALYSIS) p(x y = 1), p(x y = 0) Reconstructing probabilities data, assuming those are multidimensional normal distributions: p(x y = 0) ( μ 0, Σ 0 ) p(x y = 1) ( μ 1, Σ 1 ) from

QDA COMPLEXITY n samples, d dimensions training takes d 2 d 3 O(n + ) O(n d 2 ) O( d 3 ) computing covariation matrix inverting covariation matrix prediction takes O( d 2 ) for each sample 1 f(x) = exp ( (x μ ) T Σ 1 (x μ) ) ) k/2 1/2 2 1 (2π Σ

QDA simple decision rule fast prediction many parameters to reconstruct in high dimensions data almost never has gaussian distribution

WHAT ARE THE PROBLEMS WITH GENERATIVE APPROACH? Generative approach: trying to reconstruct it to predict. p(x, y), then use Real life distributions hardly can be reconstructed Especially in high-dimensional spaces So, we switch to discriminative approach: guessing p(y x)

LINEAR DECISION RULE Decision function is linear: d(x) =< w, x > +w 0 { d(x) > 0, d(x) < 0, class + 1 class 1 This is (finding parameters ). parametric model w, w 0

FINDING OPTIMAL PARAMETERS w, w 0 A good initial guess: get such, that error of classification is minimal ([true] = 1, [false] = 0): = [ y i sgn(d( x i ))] i events Discontinuous optimization (arrrrgh!) Let's make decision rule smooth (x) p +1(x) p 1 = f(d(x)) = 1 (x) p +1 f(0) = 0.5 f(x) > 0.5 f(x) < 0.5 if x > 0 if x < 0

LOGISTIC FUNCTION a smooth step rule. PROPERTIES σ(x) (0, 1) σ(x) + σ( x) = 1 σ (x) = σ(x)(1 σ(x)) 2 σ(x) = 1 + tanh(x/2) 1. monotonic, 2. 3. 4. e x 1 σ(x) = = 1 + e x 1 + e x

LOGISTIC FUNCTION

LOGISTIC REGRESSION Optimizing log-likelihood (with probabilities obtained with logistic function) d(x) (x) p +1(x) p 1 = = = < w, x > +w 0 σ(d(x)) σ( d(x)) 1 = ln( ( )) = L( x, ) min N i y i 1 p x yi i N i events i

Exercise: find expression and build plot for L( x i, y i ) DATA SCIENTIST PIPELINE 1. Experiments in appropriate high-level language or environment 2. After experiments are over implement final algorithm in low-level language (C++, CUDA, FPGA) Second point is not always needed.

SCIENTIFIC PYTHON NumPy vectorized computations in python Matplotlib for drawing Pandas for data manipulation and analysis (based on NumPy)

SCIENTIFIC PYTHON Scikit-learn most popular library for machine learning Scipy libraries for science and engineering Root_numpy convenient way to work with ROOT files

THE END