Monday Morning Data Mining

Size: px

Start display at page:

Download "Monday Morning Data Mining"

Ethel Haynes
10 years ago
Views:

1 Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse

2 Outline: - data mining - IceCube - Data mining in IceCube

3 Computer Scientists are different... Fakultät Physik

4 Fakultät Physik

5 Fakultät Physik

6 Building a model and predicting the outcome:

7 Can be broken down to 4 (simple) steps: 1. Find representation of data 2. Find a good algorithm 3. Validate your results 4. Apply on data

8 IceCube in a nutshell: - completed in December located at the geographic South Pole Digital Optical Modules on 86 strings - instrumented volume of 1 km 3 - subdetectors DeepCore and IceTop

9 IceCube in a nutshell: - Detection principle: Cherenkov light - Look for events of the form: ν + X e,µ,τ - Dominant background of atm. µ Use earth as a filter (select upgoing events only)

10 IceCube: Scientific goals - detection of astrophysical neutrinos - atmospheric neutrino energy spectrum - neutrino oscillations - CR-anisotropy - exotic stuff

11 Fakultät Physik

12 Fakultät Physik

13 Fakultät Physik

14 Fakultät Physik

15 Fakultät Physik

16 Fakultät Physik

17 Fakultät Physik

18 Data Mining in IceCube: - app reconstructed attributes - Data and MC do not necessarily agree - signal/background ratio ~ 10-3 interesting for studies within the scope of machine learning

19 1. Finding a good representation of your data

20 Make sure you understand your input: Attributes can be: nominal green, blue, red, yellow ordinal cool, mild, hot cool < mild < hot numerical 1,2,3,4,... labels can be: polynominal red, green, yellow, blue binominal signal, background numerical 1,2,3...,5000,...

21 Data Preprocessing: Preselection of parameters 1. Check for consistency (data vs.signal MC vs. Backgr. MC) 2. Check for missing values (nans, infs) How to handle the nans? (see next slide) 3. Eliminate the obvious (Azimuth angle, timing information...) 4. Eliminate highly correlated and constant parameters

22 Data and MC preprocessing: How to handle nans? Several possibilities: - Exclude attributes that exceed a certain number of nans - Replace by: - minimum - maximum - average - nothing at all - (median...)

23 Data and MC preprocessing: Feature Selection 1. Forward Selection start with empty selection add each unused attribute estimate performance add attribute with highest increase in performance start new round

24 Data and MC preprocessing: Feature Selection 2. Backward Elimination start with a full set of attributes Remove each of the attributes Estimate performance for each removed attribute The attribute giving the least decrease in performance is removed start new round

25 Backward Elimination in RapidMiner: Fakultät Physik

26 Data and MC preprocessing: Feature Selection 3. Mininmum Redundancy Maximum Relevance iteratively add features with biggest relevance and least redundancy Quality criterion Q: 1 Q = R( x, y) D( x, x) j x in R: Relevance; D: Redundancy; F j = already selected features F j

27 MRMR in RapidMiner:

28 Evaluating the Stability of the Parameter Selection: - Data and MC is subject to a certain variance this variance does influence the parameter selection!

29 Stability of the MRMR Selection: Jaccard Index: Kuncheva s Index: B A B A J = ) ( ), ( 2 B A r k B A k n k k rn B A I C = = = =

30 Fakultät Physik

31 2. Learning algorithms

32 Learners: 1. Decision Trees 2. Naive Bayes 3. k - Nearest Neighbours 4. Random Forests 5. Boosting

33 A bit more technically speaking: set of vectors x = (x 1,x 2,...,x n ); x i = attribute (attributes = features, variables, parameters) labels y 1,y 2,...,y n labels = target class create a model f from your examples, that predicts a y for a given x.

34 Constructing a simple model:

35 Decision Trees: Simple Classifier!

36 Naive Bayes: - based on Bayes theorem: Pr[ H E] = Pr[ E H ] Pr[ H Pr[ E] ] - assumes all attributes are independent

37 Naive Bayes: Golf data

38 Naive Bayes: Play? outlook = sunny, temperature = cool, humidity = high, windy = true

39 Naive Bayes: Pr[ yes E] = 2 / 9 3/ 9 3/ 9 Pr[ E] 3/ 9 9 /14

40 Naive Bayes: Pr[ yes Pr[ yes E] = E] = Pr[ no E] = 2 / 9 3/ / 9 3/ 9 9 /14 Pr[ E] needs to be normalized! Pr[ yes E] = Pr[ no E] = 0.795

41 Naive Bayes: What if Pr[E i yes]=0? Let s assume we don not have positive examples for outlook = rainy Pr[ sunny yes] = 4 / 9 Pr[ sunny yes] = 5/12 Pr[ overcast yes] = 5/ 9 Pr[ overcast yes] = 6 /12 Pr[ rainy yes] = 0 / 9 Pr[ rainy yes] = 1/12 Use Laplace correction!

42 k-nearest Neighbours (k-nn) - memory based classifier - unsupervised - find the k neighbours closest to x and classify by majority vote - all features should be normalized

43 Random Forests: - ensemble of decision trees - developed by Leo Breiman (2001) - no boosting between individual trees - events are classified by individual trees - uses average for final classification 1 n trees s = n trees i= 0 s i

44 Random Forests: Output MC scaled to data expectations choose final cut on signalness

45 Random Forests in rapidminer

46 Weka Random Forest:

47 Boosting: - uses an ensemble of weak classifiers (decision trees) - weights are increased for false classified events - weighted vote is applied - each classifier depends on the performance of the previous ones

48 Fakultät Physik

49 AdaBoost in rapidminer

50 Fakultät Physik

51 3. Validating the results

52 Split Validation:

53 Cross Valdiation:

54 Split Validation vs. Cross Validation: Fakultät Physik

55 Cross validated predictions: Cut Nugen Corsika Sum ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 33 4 ± ± 34

56 Cross Validation for a limited number of examples? YES! Leave One Out!

57 4. Application on data

58 Change the Scaling of the Corsika: Fakultät Physik such that it matches data for Signalness > 0.2

59 Data/MC mismatch: Underestimation of Background

60 Application of RF on 10% of data: Cut Nugen Corsika Sum Data ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 33 4 ± ±

61 Possible Improvements: Ensembles

62 Hierarchical Clustering: Agglomerative Fakultät Physik

63 Hierarchical Clustering: Divisive

64 k-means Clustering: - Pick mean at random - Calculate distance of examples to mean - assign examples to cluster - recalculate mean of the cluster - reiterate until mean does not change any longer Significantly faster than hierarchical clustering Have to know k in advance...

65 Careful when using clusters: Normalize!!!

66 Summary: - IceCube is interesting for detailed studies in machine learning - studies can be carried out using RapidMiner - MRMR for Feature Selection - Simple learners are good for benchmarks - Cross Validation is good for you! - Signal/Background separation using data mining is possible!

67 Fakultät Physik

68 Fakultät Physik

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers