ISAPP Summer Institute 2009 Neural networks in data analysis Michal Kreps ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 1/38
Outline What are the neural networks 1 Basic principles of NN 2 For what are they useful 3 4 Training of NN 5 1 Available packages 6 Comparison to other methods 7 Some examples of NN usage 8 input layer hidden layer output layer I m not expert on neural networks, only advanced user who understands principles. Goal: Help you to understand principles so that: neural networks don t look as black box to you you have starting point to use neural networks by yourself ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 2/38
Different tasks in nature Tasks to solve Some easy to solve algorithmically, some difficult or impossible Reading hand written text Lot of personal specifics Recognition often needs to work only with partial information Practically impossible in algorithmic way But our brains can solve this tasks extremely fast Use brain as inspiration ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 3/38
Tasks to solve Why (astro-)particle physicist would care about recognition of hand written text? Because it is classification task on not very regular data Do we see now letter A or H? Classification tasks are very often in our analysis Is this event signal or background? Neural networks have some advantages in this task: Can learn from known examples (data or simulation) Can evolve to complex non-linear models Easily can deal with complicated correlations amongs inputs Going to more inputs does not cost additional examples ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 4/38
Down to principles Detailed look shows huge number of neurons Neurons are interconnected, each having many inputs from other neurons Zoom in To build artificial neural network: Build model of neuron Connect many neurons together ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 5/38
Model of neuron Get inputs Give output Combine input to single quantity ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 6/38
Model of neuron Get inputs Give output Combine input to single quantity ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 7/38
Model of neuron Get inputs Give output Combine input to single quantity S = i w ix i ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 7/38
Model of neuron Get inputs Give output Pass S through activation function Most common: 2 A(x) = 1+e 1 x Combine input to single quantity S = i w ix i ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 7/38
Building neural network Complicated structure ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 8/38
Building neural network Even more complicated structure ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 8/38
Building neural network 1 2 3 4 1 5 6 7 8 Or simple structures input layer hidden layer output layer ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 8/38
Output of Feed-forward NN 1 2 3 4 5 1 Output of node k in second layer: o k = A ( i w 1 2 ik x i ) Output of node j in output layer: ) o j = A( k w kj 2 3 x k o j = A[ k w kj 2 3 A ( i w ) ] ik 1 2 x i 6 More layers usually not needed 7 If more layers involved, output follows same logic 8 input layer hidden layer output layer In following, stick to only three layers as here Knowledge is stored in weights ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 9/38
Tasks we can solve Classification Binary decision whether event belongs to one of the two classes Is this particle kaon or not? Is this event candidate for Higgs? Will Germany become soccer world champion in 2010? Output layer contains single node Conditional probability density Get probability density for each event What is decay time of a decay What is energy of particle given the observed shower What is probability that customer will cause car accident Output layer contains many nodes ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 10/38
NN work flow ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 11/38
Neural network training After assembling neural networks, weights wik 1 2 unknown Initialised to default value Random value and w 2 3 kj Before using neural network for planned task, need to find best set of weights Need some measure, which tells which set of weights performs best loss function Training is process, where based on known examples of events we search for set of weights which minimise loss function Most important part in using neural networks Bad training can result in bad or even wrong solutions Many traps in the process, so need to be careful ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 12/38
Loss function Practically any function which allows you to decide, which neural network is better In particle physics it can be one which results in best measurement In practical applications, two commonly used functions (T ji is true value with 1 for one class and 1 for other, o ji is actual neural network output) Quadratic loss function χ 2 = j w jχ 2 j = 1 j 2 i (T ji o ji ) 2 χ 2 = 0 for correct decisions and χ 2 = 2 for wrong Entropy loss function E D = j w je j D = j i log( 1 2 (1 + T jio ji )) E D = 0 for correct decisions and E D = for wrong ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 13/38
Conditional probability density Classification is easy case, we need single output node and if output is above chosen value, it is class 1 (signal) otherwise class 2 (background) Now question is how to obtain probability density? Use many nodes in output layer, each answering question, whether probability is larger than some value As example, lets have n = 10 nodes and true value of 0.63 Vector of true values for output nodes: (1, 1, 1, 1, 1, 1, 1, 1, 1, 1) Node j answers question whether probability is larger than j/10 The output vector that represents integrated probability density After differentiation we obtain desired probability density ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 14/38
Interpretation of NN output In classification tasks in general higher NN output means more signal like event NN output rescaled to interval [0; 1] can be interpreted as probability Quadratic/entropy loss function is used NN is well trained loss function minimised Purity P(o) = N selected signal (o)/n selected (o) should be linear function of NN output Looking to expression for output o j = A[ k w kj 2 3 A ( i w ) ] ik 1 2 x i NN is function, which maps n- dimensional space to 1-dimensional space purity 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0-1.0-0.5 0.0 0.5 1.0 Network output ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 15/38
How to check quality of training Define quantities: Signal efficiency: N(selected signal) ǫ s = N(all signal) Efficiency: ǫ = N(selected) N(all) Purity: N(selected signal) P = N(selected) ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 16/38
How to check quality of training Define quantities: Signal efficiency: N(selected signal) ǫ s = N(all signal) Efficiency: ǫ = N(selected) N(all) Purity: N(selected signal) P = N(selected) 160 140 120 100 80 60 40 Distribution of NN output for two classes 3 10 20-1.0 0-0.5 0.0 0.5 1.0 Network output How good is separation? ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 16/38
How to check quality of training Define quantities: Signal efficiency: N(selected signal) ǫ s = N(all signal) Efficiency: ǫ = N(selected) N(all) Purity: N(selected signal) P = N(selected) purity 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0-1.0-0.5 0.0 0.5 1.0 Network output Are we at minimum of the loss function? ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 16/38
How to check quality of training Define quantities: Signal efficiency: N(selected signal) ǫ s = N(all signal) Efficiency: ǫ = N(selected) N(all) Purity: N(selected signal) P = N(selected) Signal purity 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Each point has requirement on out > cut Best point is (1,1) Training S/B 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Signal efficiency Each point has requirement on out < cut ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 16/38
How to check quality of training Define quantities: Signal efficiency: N(selected signal) ǫ s = N(all signal) Efficiency: ǫ = N(selected) N(all) Purity: N(selected signal) P = N(selected) Signal efficiency 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Each point has requirement on out > cut Best achievable Random decision 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Efficiency ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 16/38
Issues in training Main issue: Finding minimum in high dimensional space 5 input nodes with 5 nodes in hidden layer and one output node gives 5x5 + 5 = 30 weights Imagine task to find deepest valley in the Alps (only 2 dimensions) Being put to some random place, easy to find some minimum ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 17/38
Issues in training Main issue: Finding minimum in high dimensional space 5 input nodes with 5 nodes in hidden layer and one output node gives 5x5 + 5 = 30 weights Imagine task to find deepest valley in the Alps (only 2 dimensions) Being put to some random place, easy to find some minimum But what is in next valley? ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 17/38
Issues in training In training, we start with: Large number of weights which are far from optimal value Input values in unknown ranges Large chance for numerical issues in calculation of loss function Statistical issues coming from low statistics in training In many places in physics not that much issue as training statistics can be large In industry can be quite big issue to obtain large enough training dataset But also in physics it can be sometimes very costly to obtain large enough training sample Learning by heart, when NN has enough weights to do that ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 18/38
How to treat issues All issues mentioned can be treated by neural network package in principle Regularisation for numerical issues at the beginning of training Weight decay - forgetting Helps in overtraining on statistical fluctuations Idea is that effects which are not significant are not shown often enough to keep weight to remember them Pruning of connections - remove connections, which don t carry significant information Preprocessing numerical issues search for good minimum by good starting point and decorrelation ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 19/38
Neurobayes preprocessing 6000 5000 4000 3000 2000 events 1000 0 input node 5 flat 29.709 8.7501 7.121 6.2022 5.5871 5.1186 4.7391 4.4197 4.1511 3.9122 3.706 3.52 3.3514 3.1934 3.045 2.905 2.7733 2.649 2.5321 2.4212 2.3178 2.2173 2.1242 2.0361 1.9542 1.882 1.8114 1.7462 1.6865 1.6316 1.5801 1.5328 1.4886 1.4482 1.4096 1.3739 1.3409 1.3099 1.2805 1.2531 1.2272 1.2021 1.1779 1.1551 1.133 1.1107 1.0887 1.0667 1 1.0224 1.0449 0 0.2 0.4 0.6 0.8 1 Bin to bins of equal statistics 1.2 spline fit 1 0.8 0.6 0.4 purity 0.2 0 bin n. 10 20 30 40 50 60 70 80 90 100 Calculate purity for each bin and do spline fit 50000 40000 final 30000 20000 events 10000 0 3 2 1 0 1 2 3 Use purity instead of original value, plus transform to µ = 0 and RMS = 1 ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 20/38
Electron ID at CDF Comparing two different NN packages with exactly same inputs Preprocessing is important and can improve result significantly ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 21/38
Not exhaustive list Available packages NeuroBayes http://www.phi-t.de Best I know Commercial, but fee is small for scientific use Developed originally in high energy physics experiment ROOT/TMVA Part of the ROOT package Can be used for playing, but I would advise to use something else for serious applications SNNS: Stuttgart Neural Network Simulator http://www.ra.cs.uni-tuebingen.de/snns/ Written in C, but ROOT interface exist Best open source package I know about ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 22/38
NN vs. other multivariate techniques Several times I saw statements of type, my favourite method is better than other multivariate techniques Usually lot of time is spend to understand and properly train favourite method Other methods are tried just for quick comparison without understanding or tuning Often you will even not learn any details of how other methods were setup With TMVA it is now easy to compare different methods - often I saw that Fischer discriminant won in those From some tests we found that NN implemented in TMVA is relativily bad No wonder that it is usually worst than other methods Usually it is useful to understand method you are using, none of them is black box ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 23/38
Practical tips No need for several hidden layers, one should be sufficient for you Number of nodes in hidden layer should not be very large Risk of learning by heart Number of nodes in hidden layer should not be very small Not enough capacity to learn your problem Usually O(number of inputs) in hidden layer Understand your problem, NN gets numbers and can dig information out of them, but you know meaning of those numbers ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 24/38
Practical tips Don t throw away knowledge you acquired about problem before starting using NN NN can do lot, but you don t need that it discovers what you already know If you already know that given event is background, throw it away If you know about good inputs, use them rather than raw data from which they are calculated Use neural network to learn what you don t know yet ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 25/38
Physics examples Only mentioning work done by Karlsruhe HEP group "Stable" particle identification at Delphi, CDF and now Belle Properties (E, φ, θ, Q-value) of inclusively reconstructed B mesons Decay time in semileptonic B s decays Heavy quark resonances studies: X(3872), B, B s, Λ c, Σ c B flavour tagging and B s mixing (Delphi, CDF, Belle) CPV in B s J/ψφ decays (CDF) B tagging for high p T physics (CDF, CMS) Single top and Higgs search (CDF) B hadron full reconstruction (Belle) Optimisation of KEKB accelerator ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 26/38
B s at CDF CDF Run 2-1 1.0 fb CDF Run 2-1 1.0 fb Candidates 400 350 300 250 200 150 100 ** + - + - B s B K [J/ψ K ] K Signal Background 2 Candidates per 1.25 MeV/c + - 40 B K 35 30 25 20 15 10 B s1 B s2 + + B K Signal Background 50 5 0-1.0-0.5 0.0 0.5 1.0 Neural Network Output 0 0.00 0.05 0.10 0.15 0.20 + - + - 2 M(B K )-M(B )-M(K ) [GeV/c ] Background from data Signal from MC First observation of B s1 Most precise masses ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 27/38
X(3872) mass at CDF Candidates 100 80 60 40 20 3 10 CDF II Preliminary Signal Background 0-1.0-0.5 0.0 0.5 1.0 Network Output -1 2.4 fb Powerful selection at CDF of largest sample leading to best mass measurement X(3872) first of charmonium like states Lot of effort to understand its nature 2 Candidates per 2 MeV/c 50000 40000 30000 20000 10000 0 CDF II Preliminary 60000 Total Rejected Selected 3.4 3.6 3.8 4.0 2 J/ψππ Mass (GeV/c ) -1 2.4 fb ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 28/38
Single top search Event Fraction 0.2 0.1 TLC 2Jets 1Tag -1 CDF II Preliminary 3.2 fb single top tt Wbb+Wcc Wc Wqq Diboson Z+jets QCD 0-1 -0.5 0 0.5 1 KIT Flavor Sep. Output Final data plot normalized to unit area Part of big program leading to discovery of single top production Complicated analysis with many different multivariate techniques Tagging whether jet is b-quark jet or not using NN Candidate Events 300 200 100 All Channels CDF II Preliminary 3.2 fb single top 80 tt 70 Wbb+Wcc 60 Wc 50 Wqq 40 Diboson 30 Z+jets 20 QCD 10 data 0-1 -0.5 0 0.5 1 NN Output -1 0 0.6 0.7 0.8 0.9 1 MC normalized to SM prediction ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 29/38
B flavour tagging - CDF Flavour tagging at hadron colliders is difficult task Lot of potential for improvements In plot, concentrate on continuous lines Example of not that good separation ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 30/38
Training with weights Candidates per 0.4 MeV 7000 6000 5000 4000 3000-1 CDF Run II preliminary, L = 2.4 fb #Signal Events 23300 #Background Events 472200 2000 1000 0 0.15 0.155 0.16 0.165 0.17 0.175 0.18 0.185 0.19 2 Mass(Λ c π)-mass(λ c ) [GeV/c ] One knows statistically whether event is signal or background Or has sample of mixture of signal and background and pure sample for one class Can use data only as example of application ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 31/38
Training with weights Candidates per 0.4 MeV 7000 6000 5000 4000 3000 2000 1000-1 CDF Run II preliminary, L = 2.4 fb #Signal Events 23300 #Background Events 472200 0 0.15 0.155 0.16 0.165 0.17 0.175 0.18 0.185 0.19 2 Mass(Λ c π)-mass(λ c ) [GeV/c ] Candidates per 1.30 MeV 3000 2500 2000 1500 1000-1 CDF Run II preliminary, L = 2.4 fb Data Fit Function ++ Σ c (2455) Λ c ++ + Σ c (2520) Λ c + π + π + 500 Use only data in selection Can compare data to simulation Can reweight simulation to match data (data - fit) data 0 2 0-2 0.15 0.20 0.25 0.15 0.20 0.25 + Mass(Λ c π + + )-Mass(Λ c ) [GeV/c 2 ] ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 32/38
Examples outside physics Medicine and Pharma research e.g. effects and undesirable effects of drugs early tumor recognition Banks e.g. credit-scoring (Basel II), finance time series prediction, valuation of derivates, risk minimised trading strategies, client valuation Insurances e.g. risk and cost prediction for individual clients, probability of contract cancellation, fraud recognition, justice in tariffs Trading chain stores: turnover prognosis for individual articles/stores ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 33/38
World s largest students competition: Data-Mining-Cup Very successful in competition with other data-mining methods 2005: Fraud detection in internet trading 2006: price prediction in ebay auctions 2007: coupon redemption prediction 2008: lottery customer behaviour prediction Data Mining Cup ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 34/38
NeuroBayes Turnover prognosis ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 35/38
Individual health costs Pilot project for a large private health insurance Prognosis of costs in following year for each person insured with confidence intervals 4 years of training, test on following year Results: Probability density for each customer/tariff combination Very good test results! Potential for an objective cost reduction in health management ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 36/38
Prognosis of sport events Results: Probabilities for home - tie - guest ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 37/38
Conclusions Neural network is powerful tool for your analysis They are not genius, but can do marvellous things for genius Lecture of this kind cannot really make you expert Hope that I was successful with: Give you enough insight so that neural network is not black box anymore Teaching you enough not to be afraid to try Feel free to talk to me, I m at Campus Süd, building 30.23, room 9-11 ISAPP Summer Institute 2009 M. Kreps, KIT Neural networks in data analysis p. 38/38