Day 2: Machine Learning & Data Mining Basics. Beibei Li Carnegie Mellon University

Size: px
Start display at page:

Download "Day 2: Machine Learning & Data Mining Basics. Beibei Li Carnegie Mellon University"

Transcription

1 Day 2: Machine Learning & Data Mining Basics Beibei Li Carnegie Mellon University 1

2 Getting a Job? not what you know, but who you know What exactly matters? Size? Quality (strength of tie)? How strong/weak is your social network contact? frequency amount of time emotional intensity intimacy (mutual confiding) Who (close friend vs. acquaintance) are more helpful? 2

3 Strength of Weak Ties 3

4 Why? The stronger the tie between two individuals, the larger the proportion of people to which they are both tied: eg, less likely expend your view of new information Weak ties usually are bridges - lines in a network which provides the only path between two points. Weak ties fill the structural holes in the network! Job hunting, innovation (new product adopting), rumor spreading, 4

5 not only what you know, but also who you know 5

6 What is BI and Why it is important BI Cases BI and Tools 6

7 Optimal pricing maximize expected profits (personalization: traveler, date, room type, ) Expected profits for next week/month? Will the similar pattern appear next week? Traveler demographics-- business vs. family/romance Need better pricing strategy! Unbalanced demand! Weekends (beach) vs. weekdays (highway) All the bookings this week Manager of Hilton 7

8 A List of Leading BI Vendors 8

9 Typical Architecture of Current BI tool - SQL Server 2008 Decision Analyze BPM/Action Data Mining BPM: business performance management i/business_performance_ma nagement Report Integrate OLAP/BI DataWarehousing /ETL/EAI OLAP: online analytical processing i/online_analytical_processi ng ETL: extract, transform, load EAI: enterprise application integration 9

10 SAS s BI Architecture 10

11 The BI market is trending toward analytics 11

12 SAS Enterprise Miner Why? Easy handling vast amounts of data Intuitive interface No need for SAS programming

13 SAS EM Interface

14 SAS Business Analytics Suites SAS EnterpriseMiner Documents: SAS EnterpriseMiner link: SAS Social media analytics (with a demo) SAS BI dashboard demo: SAS analytics with a demo And more. 14

15 SAS EM Analytic Strengths Unsupervised Learning Supervised Learning

16 Day 1: BI & DA Overview, Business Cases - Individual Assignment Day 2: Machine Learning & Data Mining Basics - Group Assignment Day 3: Predictive Modeling vs. Causal Inferences - How to Interpret Regression Results - Causal Identification Strategies; - Economic Value of Online Word-of-Mouth; - Social Network Influence; - Multichannel Advertising Attribution; - Randomized Field Experiment of Mobile Recommendation. Day 4: Bridging Machine Learning with Social Science: - Case 1: Interplay Between Social Media & Search Engine; - Case 2: Understand and Predict Consumer Search and Purchase Behavior; - Case 3: Text Mining & Sponsored Search Advertising. 16

17 (Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 17

18 Why? What about using summary statistics tables? Same average for X Same variance for X Same average for Y Same variance for Y Same correlation between X and Y Same linear regression: y = 3 + x/2 18

19 Your brain can efficiently process properly visualized data. 19

20 An approach to analyze data sets to summarize their main characteristics in easy-to-understand form. Often with visual graphs, without using a statistical model or having formulated a hypothesis. Some time we call it Model-free evidence. Helps to formulate hypotheses that could be tested on new data-sets. What if still using summary statistics tables? 20

21 SEMMA Variable Plot

22 22

23 23

24 Shows entire distribution of one particular variable. Each column s height is determined by the count of the number of items which fall into the bin. Bin size is a variable you can play with: wider is more smooth, while smaller bins can yield erratic plots. 24

25 25

26 Display differences between subpopulations in your data. Furthest lines are min/max. Box shows 25 th to 75 th percentiles. Thick line shows the 50 th percentile (the median). 26

27 Suggests correlation between two variables. Correlations may be positive (rising), negative (falling), or null (uncorrelated). A line of best fit (alternatively called 'trendline') can be drawn. Ability to show nonlinear relationships between variables. 27

28 Shows individual components as well as cumulative total. 28

29 Shows a variable over time. Allows comparison between different variables. Can show trends or time-relationships between variables. 29

30 (Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 31

31 Market basket Analysis Most important part of a business: what merchandise customers are buying and when? Association Rules Building association rules How good are association rules Clustering Group similar items Consumer Segmentation 32

32 33

33 Product Bundling Why expedia bundles flight, car rental, hotel? Why satellite TV starts offering Internet connection? Xfinity, DSL, Cable TV, Phone service, wireless Why MS Office contains word, excel, powerpoint and access? Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars [Forbes, Sept 8, 1997] Customers who purchase maintenance agreements are very likely to purchase large appliances (Berry and Linoff experience) Hospital visitors ask for more shopping outlets (HBR Review, May 2009, p.21) 34

34 Customers tend to buy things together What can we learn from the basket? 35

35 36

36 Transaction No. Item 1 Item 2 Item 3 Item N 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer 37

37 Trans No. Item 1 Item 2 Item 3 Day Time Customer Info. 100 Beer Diaper Chocolate Fri 6:15pm Male, 30, 101 Milk Chocolate Shampoo Sun 10:10am Female, 25, 102 Beer Wine Vodka Sat 5:30pm Male, 24, 103 Beer Cheese Diaper Fri 6:30pm Male, 32, 104 Ice Cream Diaper Beer Fri 7:00pm Male, 28, 38

38 Purchases mainly made by men, Friday evenings, 6pm ~7pm. Put the premium beer display next to the diapers up-sell/cross-sell Beer sale skyrocketed! 39

39 Basket data: collection of transactions, each consisting of a set of items bought in that transaction. Association rules with basket data: Learn which items are frequently bought together. Purpose? Product assortment for super market Bundling customer segmentation based on buying behavior, cross-selling/recommendation catalog design, web site design etc. 40

40 Learning Association Rules from Data Descriptive approach for discovering interesting associations between items in a data set. Buy diapers Then Buy beer Made order last year & Age >25 Then Apply for VIP Card 41

41 Rule format: If {set of conditions} Then {set of results} Body/ LHS Head/ RHS Then If {Diapers} {Beer} Body (Condition) implies Head (Result) Where body and head are conjunctions items. Direction of the rule matters! 42

42 What rules should be considered valid? Transaction No. Item 1 Item 2 Item Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer If {Diapers} If {Diapers} Then Then {Beer} {Ice Cream} Two basic evaluation measures: Support and Confidence of the rule 43

43 Support Support is used to measure the relevance of a rule: Support: Frequency of transactions where body and head co-occur. Support = No. of transactions containing items in body and head Total no. of transactions in data set Transaction No. Item 1 Item 2 Item Beer Diaper Chocolate 101 Milk Chocolate Shampoo body If {Diapers} Then head {Beer} 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer e.g., for the diapers & beer rule, support is 3/5: 60% of the transactions include these items. Diapers Ice Cream? Beer Diapers 1/5 3/5 44

44 Confidence Confidence is used to measure the strength of a rule: Confidence: The proportion of transactions containing the head, conditional on containing the body. Confidence = No. of transactions containing both body and head No. of transactions containing body Transaction Item 1 Item 2 Item 3 No. 100 Beer Diaper Chocolate body head 101 Milk Chocolate Shampoo 102 Beer Wine Vodka If {Diapers} Then {Beer} 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer e.g., the confidence of our rule is 3/3, i.e., in 100% of the transactions in which diapers are bought beer is also bought. Diapers Ice Cream? Beer Diapers? 1/3 3/4 45

45 Transaction No. Item 1 Item 2 Item Beer Diaper Chocolate 101 Beer Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Diapers Beer: support=3/5, confidence=3/3=100%! Beer Diapers: support=3/5, confidence=3/5=60% Shampoo Chocolate: confidence=1/1=100%! support=1/5 A rule has to satisfy a minimum support and confidence. - Both lower bound parameters are determined by the decision maker. 46

46 Traditional methods Traditional methods such as database queries: support hypothesis verification about a relationship such as the co-occurrence of diapers & beer. Transaction No. Item 1 Item 2 Item Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Confirmatory 48

47 Data Mining: Explore the data for patterns: Data Mining methods automatically discover significant/interesting associations rules from data. Find whatever patterns exist in the database, without the user having to specify in advance what to look for. Therefore allow finding unexpected correlations Exploratory 49

48 The standard methods was developed by Agrawal et. al. (1994). The Association Rules problem was defined as: Generate all association rules that have support >=minsup (minimum support) confidence >=minconf (minimum confidence) The algorithm performs an efficient search over the data to find all such rules. i.e., define N=#items in a rule 50

49 Example: Association Rules Client: One of the largest retailer corporations in the world. Business Goal Better Product Recommendations : What products are more likely to be purchased together? Association patterns in the sales? What is the right data set that you need for your analysis? 51

50 Example: Association Rules Individual Transaction Level Data set TRANSACTIONS (60,000 Rows, 4 Columns): Transaction ID Store ID Product Quantity 52

51 Raw Transaction Data 53

52 Variable Plot - Product 54

53 Variable Plot Store ID 55

54 Variable Plot - Quantity 56

55 Q: Which Product Has the Largest Quantity within a Single Transaction? Two-Way Plot 57

56 Q: Which Product Has the Largest Quantity within a Single Transaction? Two-Way Plot 58

57 Example: Association Rules Individual Transaction-Level Data set (60,000 Rows, 4 Columns): Transaction ID Store ID Product Quantity Which variable(s) do we need? (Do we need them all?) Which variable is the target(outcome) variable? Which variable(s) should define the grain of the analysis? 59

58 Run Association Rule

59 Results Window 61

60 Output Report 62

61 Rule Descriptions Lift 63

62 Recap: Association Rule Evaluation Criteria Consider the rule A =>B o Support ( Co-occurrence ) P(A,B) o Confidence ( Conditional occurrence ) P(B A) o Expected Confidence P(B) o Lift P(B A) P(A,B) P(B) = P(A)P(B) 64

63 Most methods for extracting association rules find too many trivial rules. Most are either obvious and uninteresting. Example: If Maternity Ward then patient is a woman. Confidence 100%, support 100% Need to screen for rules that are of particular interest and significance. Use domain specific conditions to filter out rules. Examples: Interestingness : Various measures for how surprising or unexpected a rule is. Example: A rule is interesting if it contradicts what is currently known (e.g., it contradicts a rule that was previously discovered). 65

64 Store planning: Placing associated items together (Milk & Bread)? May reduce basket total value (buy less unplanned merchandise) Fraud detection: Finding in insurance data that a certain doctor always works with a certain lawyer may indicate potential fraudulent activity. Is it useful for web site design? Is dissociation important? If A and NOT B C Database and Not Systems Analysis Business Intelligence 66

65 Instead of finding association between items in a single transactions, find association between items across related transactions over time. Customer ID Transaction Data. Item 1 Item 2 AA 2/2/2001 Laptop Case AA 1/13/2002 Wireless network card Router BB 4/5/2002 laptop ipaq BB 8/10/2002 Wireless network card Router Sequence : {Laptop}, {Wireless Card, Router} A sequences has to satisfy some predetermined minimum support 67

66 A restaurant menu with 100 items 161,700 combination of three items. A supermarket has >10,000 items in stock 50 million combination of two items 100 billion combination of three items Moreover, hundreds of millions of transactions 68

67 Are large fries and small fries the same product? Is the brand of ice cream more relevant than flavor? size, style, pattern, designer of clothing? Level (definition) of hierarchy! 69

68 Market basket Analysis Most important part of a business: what merchandise customers are buying and when? Association Rules Building association rules How good are association rules Clustering Group similar items Consumer Segmentation 70

69 Marketing: Customer segmentation (discovery of distinct groups of customers) for target marketing Car insurance: Identify customer groups with high average claim cost Property management: Identify houses in the same city with similar characteristics (Medical) Image analysis: discovery ROI in brain imaging to detect Alzheimer's Disease (AD). Creating document collections, or grouping web pages 71

70 Cluster: a group of data objects Similar to one another within the same cluster Dissimilar to objects in other clusters Cluster Analysis: Arrange objects into useful groups. Objects in each group share properties in common and have different properties from objects in other groups. 72

71 e.g., 28, $45K 29, $100K 55, $45K 56, $120K... Age? Income? Both? What if we have even more features? 73

72 Compact (instances are very close or similar to each other) Separate (instances among clusters are very different from each other) high within-class similarity low between-class similarity Similarity? 74

73 Key: What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. -- Webster's Dictionary 75

74 case gender glasses Moustache smile hat Each user is represented as a feature vector 76

75 Need a distance measure for different cases (ectors) case gender glasses Moustache smile hat Example for a distance measure: the Euclidean distance. X = [x1 x2 x3 x4 x5 ] Y = [y1 y2 y3 y4 y5 ] D( X, Y) n i 1 ( x i y i 2 ) 77

76 John: Age=35 Income=95K no. of credit cards=3 Rachel: Age=41 Income=215K no. of credit cards=2 D( X, Y) n i 1 ( x i y i 2 ) Distance (John, Rachel)=sqrt [(35-41) 2 +(95K-215K) 2 +(3-2) 2 ] Need normalization! 78

77 79

78 Each cluster is represented by the cluster center - mean Cluster center: the average object in the cluster eg, Clustering customers: (1) age (2) income. the cluster center is a virtual average customer with the average age of the customers in the cluster and the average income. Age Cluster center Income 80

79 (Age=43, Income=59K) Jane Income Mark (41, 51K) Distance Center/ Mean Rachel (45, 55K) (43, 55K) Age Cluster center is - Age=( )/3=43 Income=(51K+59K+55K)/3=55K 81

80 1. Arbitrarily select K objects from the data (e.g., K customers) to be each cluster center 2. For each of the remaining objects: Assign each object to the cluster whose center it is most close to Cluster center Cluster center 82

81 Then Repeat the following 3 steps until clusters converge (no change in clusters): 1. Compute the new center of the current clusters

82 2. Assign each object to the cluster whose center it is most close to Go back to Step 1, or stop if center do not change. 84

83 85

84 86

85 87

86 88

87 89

88 90

89 91

90 92

91 93

92 94

93 Will k-means always converge to a solution? Yes! Will k-means always find the optimal solution? No! Caution! k-means may converge to a local optimal solution. 95

94 Strength? Relatively efficient Simple implementation Weakness? Need to specify k, the number of clusters, in advance. Can run into local optima. Unable to handle well noisy data and outliers. 96

95 Conventional data mining consider only hard factual demographics and behavioral data Attitudinal data used, at best, to provide a snapshot of the situation at a given point in time, not as decision tool. Need to combine hard factual data with soft attitudinal data for better decisions. 97

96 Customers are Multi-Faceted Demographics who they are? Behavioral what, when and where they buy? Attitudes why they buy? What do they feel Prestige, status, styling, image, Satisfaction, loyalty, perceived value, perceived service, perceived quality, Motivations, drivers, Mood, state-of-mind, happiness, depression, Only the combination of all three dimensions provides a full picture of the customer Incorporating attitudinal data makes data mining more humane focus rather than just action -focus Building customer relationships based also on what customers think and feel rather than just through price-product relationships 98

97 Market basket Analysis Most important part of a business: what merchandise customers are buying and when? Association Rules Building association rules How good are association rules Clustering Group similar items Consumer Segmentation 99

98 (Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 10 0

99 Predictive Modeling: Training vs. Validation Data Why do we need model validation? 10 1

100 Why Model Validation? Too few parameters Too many parameters 10 2

101 Overfitting Problem 10 3

102 Overfitting Problem Model training should stop here! 10 4

103 Prediction Overview Classification (vs. Regression) Decision Trees Regression Linear Regression Logistic Regression Naive Bayes SVM K-Nearest Neighbor (KNN) 10 5

104 10 6

105 An open competition to develop an algorithm to predict Netflix user s tastes in movies improve their current algorithm by 10% 10 7

106 Anonymized data about 480k Netflix users 100M ratings for 18k movies {userid, movie, date, rating} Participants had to train their models on this data (training set) and then apply it to some additional data (test data set) to predict users ratings. {userid, movie, date,?} Predictions were submitted to Netflix and they scored the performance based on root mean squared error (RMSE) 10 8

107 Technique for recommendation systems Make a recommendation for a given user by using preferences of other users (collaborative) user A user B items The basic intuition: If users A,B share items 3,4,5 in common, then (prediction) maybe user A will also like 6,7 (or user B will also like 1,2) Problem 1: popular items may not be as informative as less popular items e.g., if everyone likes Lord of the Rings, then knowing two people like it may not indicate that they share similar preferences. One solution to this: weight the items users like according to the inverse of their popularity. Problem 2: cold start What about users that haven t rated many items or do not share anything in common with other users? e.g., in Netflix, you have to rate many movies before it will give you recommendations 10 9

108 Other Data Mining Challenges:

109 What is the heart attack risk of a person? Goal: Guess the missing values, given the known values. 11 1

110 Classification: If we are predicting a discrete value (e.g. heart attack risk high or low?) maximize proportion of correct predictions Regression: If we are predicting a real value (e.g. blood pressure rate, or probabilities) minimize mean squared error 11 2

111 Predictive task -> Mapping instances onto a predefined set of classes. Difference from Clustering? 11 3

112 Supervised learning (eg, classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (eg, clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 11 4

113 Inputs = Predictors = Independent Variables = Attributes Outputs = Responses = Dependent Variables = Class Labels = Target Variables Models = Classifiers With classification, we want to use a model to predict what output will be obtained from given inputs. How does a classifier work? 11 5

114 Input Output (Class Label) class 1 class 2 Classifier class k e.g., Are s spam or not spam? e.g., What genre of movie is this? 11 6

115 Training Data Known Attribute Values Known Class Labels Step 1: Building the Model Model Training Validating Data Known Attribute Values Known Class Labels Step 2: Validating the Model Model Validating Testing on New Data Known Attribute Values Unknown Class Labels Step 3: Applying/Using the Model Output Predict Class Label 11 7

116 The essence of classification is to find rules to separate objects. Rules are like the criteria for separation Sometimes rules are not so obvious Sometimes there can be more than one rules How to Find Rules? 11 8

117 - An upside-down tree 11 9

118 A series of nested tests: Employed Root Each node represents a test on one attribute Tests on nominal attribute: number of splits (branches) is number of possible values Numeric attributes are Discretized Yes No Class=No Node Balance <50K >=50K Leaves A class label assignment (e.g, Default/No default) Class=Yes <45 Age >=45 Leaf Class=No Class=Yes 12 0

119 Q: Mark, no job, balance 60k, 40 years old Default or not? Employed? Yes No No Default Balance<50K? Yes No Default Age<45? Yes No No Default Default 12 1

120 The example is routed down the tree according to values of attributes tested successively. Employed? Yes No At each node a test is applied to one or more attributes. No Default Balance<50K? Yes No When a leaf is reached the example is assigned to a class. Default Age<45? Yes No No Default Default 12 2

121 Most popular methods: ID3, C4.5, C5.0 CART CHAID They Share the same structure Differ by: Growing the tree - Splitting criteria Pruning the tree Termination rules 12 3

122 4 bad +13 good Age>=45 16 bad +14 good Entire sample Balance>=50K Age<45 Default: 0 Not default: 10 Default: 4 Not default: 3 Balance<50K 12 bad +1 good Age>=45 Default: 4 Not default: 1 Bad credit (Default) Good credit (Not default) Age<45 Default: 8 Not default:

123 4 bad 13 good 3 bad 11 good 16 bad 14 good Balance>=50K 16 bad 14 good Age>=45 Balance<50K 12 bad 1 good Age<45 13 bad 3 good Which Attribute to Choose? 12 5

124 How do you make decisions? Apply for school? Apply for a job? Look for a girlfriend/boyfriend? 12 6

125 Decision Tree Construction A tree is constructed by recursively partitioning the training examples into purer sub groups. pure : homogeneous (most) examples at a leaf belong to the same class. Higher confidence for prediction! At each decision note, choose the most informative attribute that best partitions the population into increasingly purer sub groups. 12 7

126 4 bad 13 good 3 bad 11 good 16 bad 14 good Balance>=50K 16 bad 14 good Age>=45 Balance<50K 12 bad 1 good Age<45 13 bad 3 good 12 8

127 Purity Measures: Gini (population diversity) Entropy (information gain) Chi-square Test 12 9

128 The Gini measure of a node is the sum of the squares of the proportions of the classes. Root Node: 0.5^ ^2 = 0.5 (even balance) 0.1^ ^2 = 0.82 (close to pure) 0.1^ ^2 = 0.82 Gini( Split) WeightedAvg Gini( all child sets) Gini - The higher the better! 13 0

129 (4/17)^2 + (13/17)^2 =0.64 (3/14)^2 + (11/14)^2 =0.66 (14/30)^2 + (16/30)^2 =0.50 Balance>=50K Age>=45 Balance<50K Age<45 (12/13)^2 + (1/13)^2 =0.86 (17/30)*0.64+(13/30)* 0.86)=0.74 (13/16)^2 + (3/16)^2 =0.70 (14/30)*0.66+(16/30)* 0.70)=

130 Purity Measures Information Gain Most common (Information Theory) Developed by Claude Shannon (1952) Information Theory, Quantum Theory and Relativity Theory were the three most influential breakthroughs in last century 13 2

131 Entropy and Information Gain Entropy How mixed/noisy is a set? ( uncertainty ) -Originally defined to account for the flow of energy through a thermodynamic process. Assume there are two classes, Pink and Green When the set of examples S contain p elements of class Pink and g elements of class Green p p g g ES ( ) log2 log2 p g p g p g p g Information Grain Expected reduction in entropy. e.g., How much closer to purity? 13 3

132 Information Gain as a Tree Splitting Criterion Assume that using attribute A, a current set will be partitioned into some number of child sets: A>1? A<=1? The information that would be gained by splitting on A InformationGain( Split) E( Parent set) WeightedAvg E( all child sets) Variable Worth The higher the better! 13 4

133 -(4/17)*log2(4/17) - (13/17)*log2(13/17) =0.79 -(3/14)*log2(3/14) - (11/14)*log2(11/14) =0.75 -(14/30)*log2(14/30) - (16/30)*log2(16/30) =0.99 Balance>=50K Age>=45 Balance<50K Age<45 -(12/13)*log2(12/13) - (1/13)*log2(1/13) = [(17/30)*0.79+(13/30)* 0.39]= (13/16)*log2(13/16) - (3/16)*log2(3/16) = [(14/30)*0.75+(16/30)* ]=

134 Brute-force search (exhaustive): At each node examine splits over each of the attributes Balance Select the attribute for which the maximum information gain is obtained <50K >=50K Example stopping rules: When to stop the partitioning? when maximum purity is obtained (i.e., all examples to reach the node are of the same class) when additional splits obtain no information gain if attributes have been used 13 6

135 Common issue: Overfitting When you have few data points, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Yes Wears green? No e.g., the rule Wears green? perfectly classifies the data Female Male 13 7

136 Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Symptom: good fit for training dataset, poor fit for testing data Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees 13 8

137 Decision Tree Classification in a Nutshell Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers To avoid overfitting Use of decision tree: Classification Test the attribute values of the sample against the decision tree 13 9

138 SAS EM Example A non-profit organization try to predict who are more likely to response to donation promotion (e.g., greeting cards) 14 0

139 SAS EM Example A non-profit organization try to predict who are more likely to response to donation promotion (e.g., greeting cards) What variables are needed? Target variable? (Output) Inputs? (Features) Level of observations? 14 1

140 SAS EM Example A non-profit organization try to predict who are more likely to response to donation promotion (e.g., greeting cards) Data Set: PVA97NK.csv 9,686 observations 28 variables Target variable: TargetB - Yes or No 14 2

141 Edit Variables 14 3

142 Edit Variables 14 4

143 Gift Amount - Distribution 14 5

144 14 6

145 14 7

146 14 8

147 14 9

148 15 0

149 15 1

150 15 2

151 15 3

152 Prediction Overview Classification (vs. Regression) Decision Trees Regression Linear Regression Logistic Regression 15 4

153 (Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 15 5

154 Form a group of 4 or 5; Write a summary report about the major methodologies of machine learning methods, including both supervised and unsupervised learning; Compare the pros and cons for each method; Find a real-world business application of each method you discuss; Page limit 5-10 pages, in English; Due: Last day of class. 15 6

155 Day 1: BI & DA Overview, Business Cases - Individual Assignment Day 2: Machine Learning & Data Mining Basics - Group Assignment Day 3: Predictive Modeling vs. Causal Inferences - How to Interpret Regression Results - Causal Identification Strategies; - Economic Value of Online Word-of-Mouth; - Social Network Influence; - Multichannel Advertising Attribution; - Randomized Field Experiment of Mobile Recommendation. Day 4: Bridging Machine Learning with Social Science: - Case 1: Interplay Between Social Media & Search Engine; - Case 2: Understand and Predict Consumer Search and Purchase Behavior; - Case 3: Text Mining & Sponsored Search Advertising. 15 7

156 15 8

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Data Mining Techniques

Data Mining Techniques 15.564 Information Technology I Business Intelligence Outline Operational vs. Decision Support Systems What is Data Mining? Overview of Data Mining Techniques Overview of Data Mining Process Data Warehouses

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Foundations of Artificial Intelligence. Introduction to Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1. 15.7 Analytics and Data Mining 1 M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page 1 15.7 Analytics and Data Mining 15.7 Analytics and Data Mining 1 Section 1.5 noted that advances in computing processing during the past 40 years have

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Role of Social Networking in Marketing using Data Mining

Role of Social Networking in Marketing using Data Mining Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining Introduction to Artificial Intelligence G51IAI An Introduction to Data Mining Learning Objectives Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Paper AA-08-2015. Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Paper AA-08-2015 Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM Delali Agbenyegah, Alliance Data Systems, Columbus, Ohio 0.0 ABSTRACT Traditional

More information

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling MS4424 Data Mining & Modelling Lecturer : Dr Iris Yeung Room No : P7509 Tel No : 2788 8566 Email : [email protected] 1 Aims To introduce the basic concepts of data mining

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing Introduction to Data Mining and Machine Learning Techniques Iza Moise, Evangelos Pournaras, Dirk Helbing Iza Moise, Evangelos Pournaras, Dirk Helbing 1 Overview Main principles of data mining Definition

More information

Decision Trees What Are They?

Decision Trees What Are They? Decision Trees What Are They? Introduction...1 Using Decision Trees with Other Modeling Approaches...5 Why Are Decision Trees So Useful?...8 Level of Measurement... 11 Introduction Decision trees are a

More information

MBA 8473 - Data Mining & Knowledge Discovery

MBA 8473 - Data Mining & Knowledge Discovery MBA 8473 - Data Mining & Knowledge Discovery MBA 8473 1 Learning Objectives 55. Explain what is data mining? 56. Explain two basic types of applications of data mining. 55.1. Compare and contrast various

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry Advances in Natural and Applied Sciences, 3(1): 73-78, 2009 ISSN 1995-0772 2009, American Eurasian Network for Scientific Information This is a refereed journal and all articles are professionally screened

More information

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics White Paper Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics Contents Self-service data discovery and interactive predictive analytics... 1 What does

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

Introduction to Learning & Decision Trees

Introduction to Learning & Decision Trees Artificial Intelligence: Representation and Problem Solving 5-38 April 0, 2007 Introduction to Learning & Decision Trees Learning and Decision Trees to learning What is learning? - more than just memorizing

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

not possible or was possible at a high cost for collecting the data.

not possible or was possible at a high cost for collecting the data. Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1 Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints

More information

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Fundamentals, robotics, recognition Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK Agenda Analytics why now? The process around data and text mining Case Studies The Value of Information

More information

Why do statisticians "hate" us?

Why do statisticians hate us? Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data

More information

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining + Business Intelligence. Integration, Design and Implementation Data Mining + Business Intelligence Integration, Design and Implementation ABOUT ME Vijay Kotu Data, Business, Technology, Statistics BUSINESS INTELLIGENCE - Result Making data accessible Wider distribution

More information

Data Mining Applications in Fund Raising

Data Mining Applications in Fund Raising Data Mining Applications in Fund Raising Nafisseh Heiat Data mining tools make it possible to apply mathematical models to the historical data to manipulate and discover new information. In this study,

More information

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Integration of Data Mining Techniques in Database Management Systems Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France

More information

Data Mining for Fun and Profit

Data Mining for Fun and Profit Data Mining for Fun and Profit Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. - Ian H. Witten, Data Mining: Practical Machine Learning Tools

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL Paper SA01-2012 Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL ABSTRACT Analysts typically consider combinations

More information

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) Machine Learning 1 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2 Outline Inductive learning Decision

More information

Data Mining with SAS. Mathias Lanner [email protected]. Copyright 2010 SAS Institute Inc. All rights reserved.

Data Mining with SAS. Mathias Lanner mathias.lanner@swe.sas.com. Copyright 2010 SAS Institute Inc. All rights reserved. Data Mining with SAS Mathias Lanner [email protected] Copyright 2010 SAS Institute Inc. All rights reserved. Agenda Data mining Introduction Data mining applications Data mining techniques SEMMA

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Easily Identify Your Best Customers

Easily Identify Your Best Customers IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do

More information

Interactive Data Mining and Design of Experiments: the JMP Partition and Custom Design Platforms

Interactive Data Mining and Design of Experiments: the JMP Partition and Custom Design Platforms : the JMP Partition and Custom Design Platforms Marie Gaudard, Ph. D., Philip Ramsey, Ph. D., Mia Stephens, MS North Haven Group March 2006 Table of Contents Abstract... 1 1. Data Mining... 1 1.1. What

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition Brochure More information from http://www.researchandmarkets.com/reports/2170926/ Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms Data Mining Techniques forcrm Data Mining The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets. Extremely large datasets Discovery of the non-obvious Useful knowledge

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining with SQL Server Data Tools

Data Mining with SQL Server Data Tools Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining

More information

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA An Overview of SAS Enterprise Miner The following article is in regards to Enterprise Miner v.4.3 that is available in SAS v9.1.3.

More information

Framing Business Problems as Data Mining Problems

Framing Business Problems as Data Mining Problems Framing Business Problems as Data Mining Problems Asoka Diggs Data Scientist, Intel IT January 21, 2016 Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS

More information

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010. Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December 2010 1 Outline Introduction Statistics vs Knowledge Discovery Predictive Modeling

More information

Pentaho Data Mining Last Modified on January 22, 2007

Pentaho Data Mining Last Modified on January 22, 2007 Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org

More information

How to Get More Value from Your Survey Data

How to Get More Value from Your Survey Data Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2

More information

Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules

Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules Market basket analysis.................................................... 2 Market basket data I.....................................................

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Data Mining: An Introduction

Data Mining: An Introduction Data Mining: An Introduction Michael J. A. Berry and Gordon A. Linoff. Data Mining Techniques for Marketing, Sales and Customer Support, 2nd Edition, 2004 Data mining What promotions should be targeted

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information