Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse
Outline: - data mining - IceCube - Data mining in IceCube
Computer Scientists are different... Fakultät Physik
Fakultät Physik
Fakultät Physik
Building a model and predicting the outcome:
Can be broken down to 4 (simple) steps: 1. Find representation of data 2. Find a good algorithm 3. Validate your results 4. Apply on data
IceCube in a nutshell: - completed in December 2010 - located at the geographic South Pole - 5160 Digital Optical Modules on 86 strings - instrumented volume of 1 km 3 - subdetectors DeepCore and IceTop
IceCube in a nutshell: - Detection principle: Cherenkov light - Look for events of the form: ν + X e,µ,τ - Dominant background of atm. µ Use earth as a filter (select upgoing events only)
IceCube: Scientific goals - detection of astrophysical neutrinos - atmospheric neutrino energy spectrum - neutrino oscillations - CR-anisotropy - exotic stuff
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Fakultät Physik
Data Mining in IceCube: - app. 2600 reconstructed attributes - Data and MC do not necessarily agree - signal/background ratio ~ 10-3 interesting for studies within the scope of machine learning
1. Finding a good representation of your data
Make sure you understand your input: Attributes can be: nominal green, blue, red, yellow ordinal cool, mild, hot cool < mild < hot numerical 1,2,3,4,... labels can be: polynominal red, green, yellow, blue binominal signal, background numerical 1,2,3...,5000,...
Data Preprocessing: Preselection of parameters 1. Check for consistency (data vs.signal MC vs. Backgr. MC) 2. Check for missing values (nans, infs) How to handle the nans? (see next slide) 3. Eliminate the obvious (Azimuth angle, timing information...) 4. Eliminate highly correlated and constant parameters
Data and MC preprocessing: How to handle nans? Several possibilities: - Exclude attributes that exceed a certain number of nans - Replace by: - minimum - maximum - average - nothing at all - (median...)
Data and MC preprocessing: Feature Selection 1. Forward Selection start with empty selection add each unused attribute estimate performance add attribute with highest increase in performance start new round
Data and MC preprocessing: Feature Selection 2. Backward Elimination start with a full set of attributes Remove each of the attributes Estimate performance for each removed attribute The attribute giving the least decrease in performance is removed start new round
Backward Elimination in RapidMiner: Fakultät Physik
Data and MC preprocessing: Feature Selection 3. Mininmum Redundancy Maximum Relevance iteratively add features with biggest relevance and least redundancy Quality criterion Q: 1 Q = R( x, y) D( x, x) j x in R: Relevance; D: Redundancy; F j = already selected features F j
MRMR in RapidMiner:
Evaluating the Stability of the Parameter Selection: - Data and MC is subject to a certain variance this variance does influence the parameter selection!
Stability of the MRMR Selection: Jaccard Index: Kuncheva s Index: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.6458&rep=rep1&type=pdf B A B A J = ) ( ), ( 2 B A r k B A k n k k rn B A I C = = = =
Fakultät Physik
2. Learning algorithms
Learners: 1. Decision Trees 2. Naive Bayes 3. k - Nearest Neighbours 4. Random Forests 5. Boosting
A bit more technically speaking: set of vectors x = (x 1,x 2,...,x n ); x i = attribute (attributes = features, variables, parameters) labels y 1,y 2,...,y n labels = target class create a model f from your examples, that predicts a y for a given x.
Constructing a simple model:
Decision Trees: Simple Classifier!
Naive Bayes: - based on Bayes theorem: Pr[ H E] = Pr[ E H ] Pr[ H Pr[ E] ] - assumes all attributes are independent
Naive Bayes: Golf data
Naive Bayes: Play? outlook = sunny, temperature = cool, humidity = high, windy = true
Naive Bayes: Pr[ yes E] = 2 / 9 3/ 9 3/ 9 Pr[ E] 3/ 9 9 /14
Naive Bayes: Pr[ yes Pr[ yes E] = E] = Pr[ no E] = 2 / 9 3/ 0.0053 0.0206 9 3/ 9 3/ 9 9 /14 Pr[ E] needs to be normalized! Pr[ yes E] = 0.205 Pr[ no E] = 0.795
Naive Bayes: What if Pr[E i yes]=0? Let s assume we don not have positive examples for outlook = rainy Pr[ sunny yes] = 4 / 9 Pr[ sunny yes] = 5/12 Pr[ overcast yes] = 5/ 9 Pr[ overcast yes] = 6 /12 Pr[ rainy yes] = 0 / 9 Pr[ rainy yes] = 1/12 Use Laplace correction!
k-nearest Neighbours (k-nn) - memory based classifier - unsupervised - find the k neighbours closest to x and classify by majority vote - all features should be normalized
Random Forests: - ensemble of decision trees - developed by Leo Breiman (2001) - no boosting between individual trees - events are classified by individual trees - uses average for final classification 1 n trees s = n trees i= 0 s i
Random Forests: Output MC scaled to data expectations choose final cut on signalness
Random Forests in rapidminer
Weka Random Forest:
Boosting: - uses an ensemble of weak classifiers (decision trees) - weights are increased for false classified events - weighted vote is applied - each classifier depends on the performance of the previous ones
Fakultät Physik
AdaBoost in rapidminer
Fakultät Physik
3. Validating the results
Split Validation:
Cross Valdiation:
Split Validation vs. Cross Validation: Fakultät Physik
Cross validated predictions: Cut Nugen Corsika Sum 0.990 4817 ± 44 93 ± 38 4910 ± 58 0.992 4633 ± 43 80 ± 30 4633 ± 52 0.994 4414 ± 41 57 ± 30 4414 ± 51 0.996 4122 ± 32 49 ± 26 4122 ± 41 0.998 3695 ± 46 18 ± 17 3695 ± 49 1.000 2932 ± 33 4 ± 9 2932 ± 34
Cross Validation for a limited number of examples? YES! Leave One Out!
4. Application on data
Change the Scaling of the Corsika: Fakultät Physik such that it matches data for Signalness > 0.2
Data/MC mismatch: Underestimation of Background
Application of RF on 10% of data: Cut Nugen Corsika Sum Data 0.990 4817 ± 44 93 ± 38 4910 ± 58 4988 0.992 4633 ± 43 80 ± 30 4633 ± 52 4757 0.994 4414 ± 41 57 ± 30 4414 ± 51 4476 0.996 4122 ± 32 49 ± 26 4122 ± 41 4134 0.998 3695 ± 46 18 ± 17 3695 ± 49 3638 1.000 2932 ± 33 4 ± 9 2932 ± 34 2833
Possible Improvements: Ensembles
Hierarchical Clustering: Agglomerative Fakultät Physik
Hierarchical Clustering: Divisive
k-means Clustering: - Pick mean at random - Calculate distance of examples to mean - assign examples to cluster - recalculate mean of the cluster - reiterate until mean does not change any longer Significantly faster than hierarchical clustering Have to know k in advance...
Careful when using clusters: Normalize!!!
Summary: - IceCube is interesting for detailed studies in machine learning - studies can be carried out using RapidMiner - MRMR for Feature Selection - Simple learners are good for benchmarks - Cross Validation is good for you! - Signal/Background separation using data mining is possible!
Fakultät Physik
Fakultät Physik