Maximum Likelihood vs. Least Squares

Similar documents
Spin-Off from Physics Research to Business

Neural networks in data analysis

The Best from Two Worlds The Blue Yonder View on Data Analytics

NeuroBayes Big Data Predictive Analytics for High Energy Physics & "Real Life

Blue Yonder Research Papers

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining Algorithms Part 1. Dejan Sarka

Top rediscovery at ATLAS and CMS

Measurement of the Mass of the Top Quark in the l+ Jets Channel Using the Matrix Element Method

Theory versus Experiment. Prof. Jorgen D Hondt Vrije Universiteit Brussel jodhondt@vub.ac.be

STA 4273H: Statistical Machine Learning

Study of the B D* ℓ ν with the Partial Reconstruction Technique

How To Teach Physics At The Lhc

Bounding the Higgs width at the LHC

Gamma Distribution Fitting

Java Modules for Time Series Analysis

Lecture 9: Introduction to Pattern Analysis

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Data Mining mit der JMSL Numerical Library for Java Applications

An Introduction to Machine Learning

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Calculating VaR. Capital Market Risk Advisors CMRA

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Monday Morning Data Mining

4. Continuous Random Variables, the Pareto and Normal Distributions

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Analecta Vol. 8, No. 2 ISSN

Measurement of Neutralino Mass Differences with CMS in Dilepton Final States at the Benchmark Point LM9

Linear Classification. Volker Tresp Summer 2015

Top-Quark Studies at CMS

Package EstCRM. July 13, 2015

Monotonicity Hints. Abstract

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Component Ordering in Independent Component Analysis Based on Data Power

More details on the inputs, functionality, and output can be found below.

SAS Certificate Applied Statistics and SAS Programming

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Theoretical Particle Physics FYTN04: Oral Exam Questions, version ht15

INTELLIGENT ENERGY MANAGEMENT OF ELECTRICAL POWER SYSTEMS WITH DISTRIBUTED FEEDING ON THE BASIS OF FORECASTS OF DEMAND AND GENERATION Chr.

Data Mining and Visualization

Inference of Probability Distributions for Trust and Security applications

Risk and return (1) Class 9 Financial Management,

Neural Network Add-in

How To Understand The Theory Of Probability

PHYSICS WITH LHC EARLY DATA

Statistical Machine Learning

Social Media Mining. Data Mining Essentials

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Data, Measurements, Features

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Geostatistics Exploratory Analysis

Christfried Webers. Canberra February June 2015

Azure Machine Learning, SQL Data Mining and R

Organizing Your Approach to a Data Analysis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Customer Classification And Prediction Based On Data Mining Technique

> plot(exp.btgpllm, main = "treed GP LLM,", proj = c(1)) > plot(exp.btgpllm, main = "treed GP LLM,", proj = c(2)) quantile diff (error)

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

MANAGING QUEUE STABILITY USING ART2 IN ACTIVE QUEUE MANAGEMENT FOR CONGESTION CONTROL

Data analysis in Par,cle Physics

Dongfeng Li. Autumn 2010

A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Predict Influencers in the Social Network

Quantitative Inventory Uncertainty

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Clustering & Visualization

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Marketing Mix Modelling and Big Data P. M Cain

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

STATISTICA Formula Guide: Logistic Regression. Table of Contents

How To Find The Higgs Boson

Advanced analytics at your hands

Data mining and statistical models in marketing campaigns of BT Retail

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Supporting Online Material for

Chapter 12 Discovering New Knowledge Data Mining

Basics of Statistical Machine Learning

Visualization methods for patent data

Calorimetry in particle physics experiments

Descriptive Statistics

Software for data analysis and accurate forecasting. Forecasts for Guaranteed Profits. The Predictive Analytics Software for Insurance Companies

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Representing Uncertainty by Probability and Possibility What s the Difference?

Drawing a histogram using Excel

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Transcription:

Precondition Basis Maximum Likelihood vs. Least Squares pdf exactly known Height of pdf Mean and variance known Deviation from mean Efficiency Complexity Robustness Correlated measurements Special case Maximal Complicated, mostly non-linear No (tail modelling) Difficult Identical for Maximal among linear estimators For linear models exactly solvable No (tails) easy, with covariance matrix Gaussian errors Statistical Methods BND 2011 Einführung in die Datenanalyse Michael Feindt Maria Laach 2004

Monte Carlo Statistical Methods BND 2011 Einführung in die Datenanalyse Michael Feindt Feindt Maria Laach 2004

Quasirandom numbers Pseudorandom numbers Regular grid Statistical Methods BND 2011 Einführung in die Datenanalyse Michael Feindt Feindt Maria Laach 2004

NeuroBayes in and outside the Ivory Tower of High Energy Physics Particle Physics Seminar, University Bonn, January 13, 2011 Michael IEKP KCETA Karlsruhe Feindt, Institute of Technology KIT, KIT Scientific Advisor Phi-T GmbH KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Agenda of this talk: What is NeuroBayes? Where does it come from? How to use NeuroBayes classification Robustness, speed, generalisation ability, ease-to-use NeuroBayes output is a Bayesian posterior probability How to find data-monte Carlo disagreements with NeuroBayes How to train NeuroBayes from data when no good MC model is available Examples of successful NeuroBayes applications in physics Full B reconstruction at B factories Examples from industry applications

History of NeuroBayes 1993-2003 M.F. & co-workers: experience with NN in DELPHI, development of many packages: ELEPHANT, MAMMOTH, BSAURUS etc. 1999 Invention of NeuroBayes algorithm 1997-now extensive use of NeuroBayes in CDF II 2000-2002 NeuroBayes - specialisation for economy at the University of Karlsruhe, supported by BMBF 2002: Phi-T GmbH founded, industrial projects, further developments 2008: Foundation of sub-company Phi-T products & services, 2. office in Hamburg 2008-now: extensive use of NeuroBayes in Belle 2010: LHCb decides to use NeuroBayes massively to optimise reconstruction code 2011: Relaunch as Blue Yonder Phi-T owns exclusive rights for NeuroBayes Staff (currently 50) almost all physicists (mainly from HEP) Einführung Continuous die further Datenanalyse development of NeuroBayes Michael Feindt Maria Laach 2004

Successful in competition with other data-miningmethods World s largest students competion: Data-Mining-Cup 2005: Fraud detection in internet trading 2006: price prediction in ebay auctions 2007: coupon redemption prediction 2008: lottery customer behaviour prediction

Since 2009: new rules: only up to 2 teams per University and 2009 Task: Prognosis of the turnaround of 8 books in 2500 book stores Winner: Uni Karlsruhe II with help of NeuroBayes and 2010... Task: Winner: Optimisation of individual customer care measures in online shop Uni Karlsruhe II with help of NeuroBayes

NeuroBayes task 1: Classifications Classification: Binary targets: Each single outcome will be yes or no NeuroBayes output is the Bayesian posterior probability that answer is yes (given that inclusive rates are the same in training and test sample, otherwise simple transformation necessary). Examples: > This elementary particle is a K meson. > This jet is a b-jet. > This three-particle combination is a D+. > This event is real data and not Monte Carlo. > This neutral B-meson was a particle and not an antiparticle at production time. > Customer Meier will cancel his contract next year.

NeuroBayes task 2: Conditional probability densities Probability density for real valued targets: For each possible (real) value a probability (density) is given. From that all statistical quantities like mean value, median, mode, standard deviation, etc. can be deduced. f ( t x) Deviations from normal distribution, e.g. crash probability Expectation value Standard deviation volatility Mode Examples: > Energy of an elementary particle (e.g a semileptonically decaying B meson with missing neutrino) > Q value (invariant mass) of a decay > Lifetime of a decay > Phi-direction of an inclusively reconstructed B-meson in a jet. > Turnaround of an article next year t (very important in industrial applications)

One way to construct a one dimensional test statistic from multidimensional input (a MVA-method): Neural networks Self learning procedures, copied from nature Frontal Lobe Motor Cortex Parietal Cortex Temporal Lobe Brain Stem Occipital Lobe Cerebellum

Neural networks The NeuroBayes classification core is based on a simple feed forward neural network. The information (the knowledge, the expertise) is coded in the connections between the neurons. Each neuron performs fuzzy decisions. A neural network can learn from examples. Human brain: about 100 billion ( 10 11 ) neurons about 100 trillion ( 10 14 ) connections NeuroBayes : 10 to few 100 neurons

Neural Network basic functions

Neural network transfer functions

NeuroBayes classifications Input Preprocessing NeuroBayes Teacher: Learning of complex relationships from existing data bases (e.g. Monte Carlo) NeuroBayes Expert: Prognosis for unknown data Significance control Postprocessing Output

How it works: training and application Historic or simulated data Data set a =... b =... c =...... t =! NeuroBayes Teacher Expert system Expertise Probability that hypothesis is correct (classification) or probability density for variable t Actual (new real) data Data set a =... b =... c =...... t =? NeuroBayes Expert f t t

Neural network training Backpropagation (Rumelhardt et al. 1986): Calculate gradient backwards by applying chain rule Optimise using gradient descent method. Step size??

Neural network training Difficulty: find global minimum of highly non-linear function in high (~ >100) dimensional space. Imagine task to find deepest valley in the Alps (just 2 dimensions) Easy to find the next local minimum... but globally......impossible! è needs good preconditioning

NeuroBayes strengths: NeuroBayes is a very powerful algorithm excellent generalisability (does not overtrain) robust always finds good solution even with erratic input data fast automatically select significant variables output interpretable as Bayesian a posteriori probability can train with weights and background subtraction NeuroBayes is easy to use Examples and documentation available Good default values for all options fast start! Direct interface to TMVA available Introduction into root planned

<phi-t> NeuroBayes > is based on 2nd generation neural network algorithms, Bayesian regularisation, optimised preprocessing with non-linear transformations and decorrelation of input variables and linear correlation to output. > learns extremely fast due to 2nd order BFGS methods and even faster with 0-iteration mode. > produces small expertise files. > is extremely robust against outliers in input data. > is immune against learning by heart statistical noise. > tells you if there is nothing relevant to be learned. > delivers sensible prognoses already with small statistics. > can handle weighted events, even negative weights. > has advanced boost and cross validation features. > is steadily further developed professionally.

Bayes Theorem P(T D) P(D T), but Likelihood Prior Posterior Evidence NeuroBayes internally uses Bayesian arguments for regularisation NeuroBayes automatically makes Bayesian posterior statements

Teacher code fragment (1) #include "NeuroBayesTeacher.hh //create NeuroBayes instance NeuroBayesTeacher* nb = NeuroBayesTeacher::Instance(); const int nvar = 14; //number of input variables nb->nb_def_node1(nvar+1); nb->nb_def_node2(nvar); nb->nb_def_node3(1); nb->nb_def_task("cla"); nb->nb_def_iter(10); // nodes in input layer // nodes in hidden layer // nodes in output layer // binominal classification // number of training iterations nb->setoutputfile("bsdspiksk_expert.nb"); nb->setrootfile("bsdspiksk_expert.root"); // expertise file name // histogram file name

Teacher code fragment (2) // in training event loop nb->setweight(1.0); //set weight of event // set Target nb->settarget(0.0) ; // set Target, this event is BACKGROUND, else set to 1. InputArray[0] = GetValue(back,"BsPi.Pt"); // define input variables InputArray[1] = TMath::Abs(GetValue(back,"Bs.D0"));... nb->setnextinput(nvar,inputarray); //end of event loop nb->trainnet(); //perform training Many options existing, but this simple code usually already gives very good results.

Expert code fragment #include "Expert.hh"... Expert* nb = new Expert("../train/BsDsPiKSK_expert.nb",-2);... InputArray[0] = GetValue(signal,"BsPi.Pt"); InputArray[1] = TMath::Abs(GetValue(signal,"Bs.D0"));... Netout = nb->nb_expert(inputarray);

input variables ordered by relevance (standard deviations of additional information)

NeuroBayes training output (analysis file) NeuroBayes output distribution red:signal black: background Signal purity S/(S+B) in bins of NeuroBayes output. If on diagonal, then P=2*NBout+1 is the probability that the event actually is signal. This proves that NB always is well calibrated in the training.

NeuroBayes training output (analysis file) Purity vs. signal efficiency plot for different NeuroBayes output cuts. Should be as much in upper right corner as possible. The lower curve comes from cutting the wrong way round. Signal efficiency vs. total efficiency when cutting at different NeuroBayes outputs (lift chart). The area between blue curve and diagonal should be large. Physical region: white Right diagonal: events randomly sorted, no individualisation. Left diagonal border: completely correctly sorted, first all signal events, then all bg. Gini index: classification quality measure, The larger, the better.

NeuroBayes training output (analysis file) Correlation matrix of input variables. 1.row/column: training target

NeuroBayes training output (analysis file) Most important input variable significance: 78 standard deviations Accepted for the training Probability integral transformed input variable distribution: signal, background ( this is a binary variable!) Signal purity as function of the input variable (this case: unordered classes) Mean 0, width 1 transformation of signal purity of transformed input variable Purity-efficiency plot of this variable compared to that of complete NeuroBayes

NeuroBayes training output (analysis file) 2.most important input variable, alone 67 standard deviations. But added after most important var taken into account only 11 sigma. Probability integral transformed input variable distribution: signal, background ( this is a largely continuous variable!) Signal purity as function of the input variable (this case: spline fit) Mean 0, width 1 transformation of (fitted) signal purity of input variable Purity-efficiency plot of this variable compared to that of complete NeuroBayes

NeuroBayes training output (analysis file) 39. most important input variable, alone 17 standard deviations, but only 0.6 sigma added after more significant variables. Ignored for the training Probability integral transformed input variable distribution: signal, background For 3339 events this input was not available (delta-function) Signal purity as function of the input variable (this case: spline fit + delta) Mean 0, width 1 transformation of (fitted) signal purity of input variable Due to the preprocessing 94 the delta is mapped to 0, not to its purity.

NeuroBayes output is a linear measure of the Bayesian posterior signal probability: P T (S) = (NB +1)/2 Signal to background ratio in training set: r T = S T If the training was performed with different S/B than actually present in expert dataset, one can transform the signal probability:!! P E (S) = # 1+ % $ 1 1 P T (S) "1 & () r T ' r E B T, in expert set:! r E = S E B E

Hunting data MC disagreements with NeuroBayes 1. Use data as signal, Monte Carlo as background. 2. Train NeuroBayes classification network. If MC model describes data well, nothing should be learned! 3. Look at most significant variables of this training. These give a hint where MC is not good. Could e.g. be in pt-spectrum or invariant mass spectrum (width). 4.Decide whether effects are due to physics modelling or detector resolution/efficiency. 5.Reweigh MC by w=(1+nbout)/(1-nbout) or produce a more realistic MC à goto 1

Scenario: MC for signal available, but not for backgrounds Idea: take background from sidebands in data Check that network cannot learn mass (by training left sideband vs. right sideband: remove input variables until this net cannot learn anything more) Works well if data-mc agreement quite good

Scenario: Neither reliable signal nor background Monte Carlo available Idea: Training with background subtraction Signal: Peak region weight 1 Sideband region with weight -1 (statistical subtraction) Background: Sideband region with weight 1 works very well! also for Y(2S) and Y(3S)! Although just trained on Y(1S)

Example for data-only training (on 1. resonance)

NeuroBayes B s to J/ψ Φ selection without MC (2 stage background subtraction training process) all data soft cut on net 1, input to second NeuroBayes training soft preselection, input to first NeuroBayes training cut on net 2

Exploiting S/B information more efficiently : The splot-method Fit data signal and background in one distribution (e.g. mass). Compute splot weights w s for signal (may be <0 or >1) as function of mass from fit. 2 Candidates per 0.5 MeV/c 16000 14000 12000 10000 8000 6000 CDF Run II preliminary S 110253 B 1338453 4000 2000 Train NeuroBayes network 0 with each event treated both as signal with signal weight w S and as background with weight 1-w S. Soft cut on output enriches S/B considerably: Make sure network cannot learn mass! (Paper in preparation) 2.26 2.27 2.28 2.29 2.3 2.31 + - Mass(p K + ) [GeV/c 2 ]

More than 60 diploma and Ph.D. theses and many publications from experiments DELPHI, CDF II, AMS, CMS and Belle used NeuroBayes or predecessors very successfully. Also ATLAS and LHCb applications starting. Many of these can be found at www.neurobayes.de Talks about NeuroBayes and applications: www-ekp.physik.uni-karlsruhe.de/~feindt à Forschung

Some NeuroBayes highlights: Bs oscillations Discovery of excited Bs states X(3872) properties Single top quark production discovery High mass Higgs exclusion

Just a few examples NeuroBayes soft electron identification for CDF II Thesis U. Kerzel: on basis of Soft Electron Collection (much more efficient than cut selection or JetNet with same inputs) - after clever preprocessing by hand and careful learning parameter choice this could also be as good as NeuroBayes

Just a few examples NeuroBayes selection

Just a few examples First observation of B_s1 and most precise of B_s2* Selection using NeuroBayes

Belle B-factory ran very successfully 2000-2010. KIT joined Belle Collaboration in 2008 and introduced NeuroBayes. Continuum subtraction Flavour tagging Particle ID S/B selection optim. Full B reconstruction NeuroBayes enhances efficiency of flavour tagging calibration reaction B-> D* l by 71% at same purity.

Physics at B factory (asymmetric e+e- collider at Y(4S)) Y(4S) decays into 2 B mesons almost at rest in CMS. Decay products of 2 Bs not easily distinguishable. Many 1000 exclusive decay chains per B. Reconstruct as many as possible Bs exclusively (tag side). Then all other reconstructed particles belong to other B (signal side). And kinematics of signal side uniquely determined, allows missing mass reconstruction.

Hierarchical Full Reconstruction

Example D0 signals of very different purity with/without NB cut

Optimal combination of decay channels of very different purity using NeuroBayes outputs Precuts such that number of additional bg events per additional signal event is constant.

Full reconstruction of B mesons in 1042 decay chains. Hierarchical probabilistic reconstruction system with 71 NeuroBayes networks, fully automatic (NeuroBayes Factory) B+ efficiency increased by ~104 % at same (total) purity (corresponds to many years of additional data taking) using NeuroBayes classical algorithm

Alternatively one can make the sample cleaner, e.g. to the same background level: B+ efficiency increased by +88% at same background level. (Real data plots, about 8% of full data set) signal using NeuroBayes background signal (classical algorithm)

Alternatively one can make the sample much cleaner: e.g. at same signal efficiency as classical algorithm: B+ background suppression by factor of 17! background (classical algorithm) background using NeuroBayes signal

First application test on real data: Select B 0 D* + l ν on signal side, fully reco. B on tag side. Calculate missing mass squared on signal side. Peak at 0 from missing neutrino expected and seen. Efficiency more than doubled with new algorithm!

Flexibility: Working with NeuroBayes allows continuous choice of working point in purity-efficiency. NIM-paper in preparation.

customers & projects Very successful projects for: among others BGV and VKB car insurances Lupus Alpha Asset Management Otto Versand (mail order business) Thyssen Krupp (steel industry) AXA and Central health insurances dm drogerie markt (drugstore chain) Libri (book wholesale)... expanding

Individual risk prognoses for car insurances: Accident probability Cost probability distribution Large damage prognosis Contract cancellation prob. very successful at

Correlation among input variables, target color coded Ramler II-Plot Einführung in die Datenanalyse Michael Feindt Maria Laach 148 2004

Contract cancellations in a large financial institute Real cancellation rate as function of cancellation rate predicted by NeuroBayes Very good performance within statistical errors

Near Future Turnaround Predictions for Chain Stores 1. Time series modelling 2. Correction and error estimate using NeuroBayes

Turnover prognosis for mail order business

Typical test results (always Colour codes: NeuroBayes very successful) better same worse than classical methods Trainings- Seasons Test-Seasons

Prognosis of individual health costs Pilot project for a large private health insurance Prognosis of costs in following year for each person insured with confidence intervals 4 years of training, test on following year Results: Probability density for each customer/tarif combination Kunde N. 00000 Mann, 44 Tarif XYZ123 seit ca. 17 Jahre Very good test results! Has potential for a real and objective cost reduction in health management

Prognosis of financial markets VDI-Nachrichten, 9.3.2007 NeuroBayes based risk averse market neutral fonds for institutional investors. Fully automatic trading (2007-2009: 20 Mio, since 2010: 130 Mio ) Lupus Alpha NeuroBayes Short Term Trading Fund Test Test Börsenzeitung, 6.2.2008

Licenses NeuroBayes is commercial software. All rights belong to Phi-T GmbH. It is not open source. CERN, Fermilab, KEK have licenses for use in high energy physics research. Expert runs without license (can run in the grid!) License only needed for training networks. For purchasing additional teacher licenses (for computers outside CERN) please contact Phi-T. Bindings to many programming languages exist. Code generator for easy usage exists.

Prognosis of sports events from historical data: NeuroNetz er Results: Probabilities for home - tie - guest

Documentation Basics: M. Feindt, A Neural Bayesian Estimator for Conditional Probability Densities, E-preprint-archive physics 0402093 M. Feindt, U. Kerzel, The NeuroBayes Neural Network Package, NIM A 559(2006) 190 Web Sites: www.phi-t.de www.blue-yonder.com (Research and commercial companies websites, German and English) www-ekp.physik.uni-karlsruhe.de/~feindt (some NeuroBayes talks can be found here under -> Forschung) www.neurobayes.de (English site on physics results with NeuroBayes & all diploma and PhD theses using NeuroBayes, and discussion forum and FAQ for usage in physics.. Please use this and also post your results here! )

The <phi-t> mouse game: or: even your ``free will is predictable //www.phi-t.de/mousegame