Day 2: Machine Learning & Data Mining Basics. Beibei Li Carnegie Mellon University

Day 2: Machine Learning & Data Mining Basics Beibei Li Carnegie Mellon University 1

Getting a Job? not what you know, but who you know What exactly matters? Size? Quality (strength of tie)? How strong/weak is your social network contact? frequency amount of time emotional intensity intimacy (mutual confiding) Who (close friend vs. acquaintance) are more helpful? 2

Strength of Weak Ties 3

Why? The stronger the tie between two individuals, the larger the proportion of people to which they are both tied: eg, less likely expend your view of new information Weak ties usually are bridges - lines in a network which provides the only path between two points. Weak ties fill the structural holes in the network! Job hunting, innovation (new product adopting), rumor spreading, 4

not only what you know, but also who you know 5

What is BI and Why it is important BI Cases BI and Tools 6

Optimal pricing maximize expected profits (personalization: traveler, date, room type, ) Expected profits for next week/month? Will the similar pattern appear next week? Traveler demographics-- business vs. family/romance Need better pricing strategy! Unbalanced demand! Weekends (beach) vs. weekdays (highway) All the bookings this week Manager of Hilton 7

A List of Leading BI Vendors 8

Typical Architecture of Current BI tool - SQL Server 2008 Decision Analyze BPM/Action Data Mining BPM: business performance management http://en.wikipedia.org/wik i/business_performance_ma nagement Report Integrate OLAP/BI DataWarehousing /ETL/EAI OLAP: online analytical processing http://en.wikipedia.org/wik i/online_analytical_processi ng ETL: extract, transform, load EAI: enterprise application integration http://www.microsoft.com/sql/solutions/bi/default.mspx 9

SAS s BI Architecture 10

The BI market is trending toward analytics 11

SAS Enterprise Miner Why? Easy handling vast amounts of data Intuitive interface No need for SAS programming

SAS EM Interface

SAS Business Analytics Suites SAS EnterpriseMiner Documents: http://support.sas.com/documentation/onlinedoc/miner/ SAS EnterpriseMiner link: http://www.sas.com/technologies/analytics/datamining/miner/ SAS Social media analytics (with a demo) http://www.sas.com/software/customer-intelligence/socialmedia-analytics/ SAS BI dashboard demo: http://www.sas.com/technologies/bi/entbiserver/index.html SAS analytics with a demo http://www.sas.com/technologies/analytics/index.html And more. 14

SAS EM Analytic Strengths Unsupervised Learning Supervised Learning

Day 1: BI & DA Overview, Business Cases - Individual Assignment Day 2: Machine Learning & Data Mining Basics - Group Assignment Day 3: Predictive Modeling vs. Causal Inferences - How to Interpret Regression Results - Causal Identification Strategies; - Economic Value of Online Word-of-Mouth; - Social Network Influence; - Multichannel Advertising Attribution; - Randomized Field Experiment of Mobile Recommendation. Day 4: Bridging Machine Learning with Social Science: - Case 1: Interplay Between Social Media & Search Engine; - Case 2: Understand and Predict Consumer Search and Purchase Behavior; - Case 3: Text Mining & Sponsored Search Advertising. 16

(Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 17

Why? What about using summary statistics tables? Same average for X Same variance for X Same average for Y Same variance for Y Same correlation between X and Y Same linear regression: y = 3 + x/2 18

Your brain can efficiently process properly visualized data. 19

An approach to analyze data sets to summarize their main characteristics in easy-to-understand form. Often with visual graphs, without using a statistical model or having formulated a hypothesis. Some time we call it Model-free evidence. Helps to formulate hypotheses that could be tested on new data-sets. What if still using summary statistics tables? 20

SEMMA Variable Plot

Shows entire distribution of one particular variable. Each column s height is determined by the count of the number of items which fall into the bin. Bin size is a variable you can play with: wider is more smooth, while smaller bins can yield erratic plots. 24

Display differences between subpopulations in your data. Furthest lines are min/max. Box shows 25 th to 75 th percentiles. Thick line shows the 50 th percentile (the median). 26

Suggests correlation between two variables. Correlations may be positive (rising), negative (falling), or null (uncorrelated). A line of best fit (alternatively called 'trendline') can be drawn. Ability to show nonlinear relationships between variables. 27

Shows individual components as well as cumulative total. 28

Shows a variable over time. Allows comparison between different variables. Can show trends or time-relationships between variables. 29

(Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 31

Market basket Analysis Most important part of a business: what merchandise customers are buying and when? Association Rules Building association rules How good are association rules Clustering Group similar items Consumer Segmentation 32

Product Bundling Why expedia bundles flight, car rental, hotel? Why satellite TV starts offering Internet connection? Xfinity, DSL, Cable TV, Phone service, wireless Why MS Office contains word, excel, powerpoint and access? Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of also purchasing one of three types of candy bars [Forbes, Sept 8, 1997] Customers who purchase maintenance agreements are very likely to purchase large appliances (Berry and Linoff experience) Hospital visitors ask for more shopping outlets (HBR Review, May 2009, p.21) 34

Customers tend to buy things together What can we learn from the basket? 35

Transaction No. Item 1 Item 2 Item 3 Item N 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer 37

Trans No. Item 1 Item 2 Item 3 Day Time Customer Info. 100 Beer Diaper Chocolate Fri 6:15pm Male, 30, 101 Milk Chocolate Shampoo Sun 10:10am Female, 25, 102 Beer Wine Vodka Sat 5:30pm Male, 24, 103 Beer Cheese Diaper Fri 6:30pm Male, 32, 104 Ice Cream Diaper Beer Fri 7:00pm Male, 28, 38

Purchases mainly made by men, Friday evenings, 6pm ~7pm. Put the premium beer display next to the diapers up-sell/cross-sell Beer sale skyrocketed! 39

Basket data: collection of transactions, each consisting of a set of items bought in that transaction. Association rules with basket data: Learn which items are frequently bought together. Purpose? Product assortment for super market Bundling customer segmentation based on buying behavior, cross-selling/recommendation catalog design, web site design etc. 40

Learning Association Rules from Data Descriptive approach for discovering interesting associations between items in a data set. Buy diapers Then Buy beer Made order last year & Age >25 Then Apply for VIP Card 41

Rule format: If {set of conditions} Then {set of results} Body/ LHS Head/ RHS Then If {Diapers} {Beer} Body (Condition) implies Head (Result) Where body and head are conjunctions items. Direction of the rule matters! 42

What rules should be considered valid? Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer If {Diapers} If {Diapers} Then Then {Beer} {Ice Cream} Two basic evaluation measures: Support and Confidence of the rule 43

Support Support is used to measure the relevance of a rule: Support: Frequency of transactions where body and head co-occur. Support = No. of transactions containing items in body and head Total no. of transactions in data set Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo body If {Diapers} Then head {Beer} 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer e.g., for the diapers & beer rule, support is 3/5: 60% of the transactions include these items. Diapers Ice Cream? Beer Diapers 1/5 3/5 44

Confidence Confidence is used to measure the strength of a rule: Confidence: The proportion of transactions containing the head, conditional on containing the body. Confidence = No. of transactions containing both body and head No. of transactions containing body Transaction Item 1 Item 2 Item 3 No. 100 Beer Diaper Chocolate body head 101 Milk Chocolate Shampoo 102 Beer Wine Vodka If {Diapers} Then {Beer} 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer e.g., the confidence of our rule is 3/3, i.e., in 100% of the transactions in which diapers are bought beer is also bought. Diapers Ice Cream? Beer Diapers? 1/3 3/4 45

Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Beer Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Diapers Beer: support=3/5, confidence=3/3=100%! Beer Diapers: support=3/5, confidence=3/5=60% Shampoo Chocolate: confidence=1/1=100%! support=1/5 A rule has to satisfy a minimum support and confidence. - Both lower bound parameters are determined by the decision maker. 46

Traditional methods Traditional methods such as database queries: support hypothesis verification about a relationship such as the co-occurrence of diapers & beer. Transaction No. Item 1 Item 2 Item 3 100 Beer Diaper Chocolate 101 Milk Chocolate Shampoo 102 Beer Wine Vodka 103 Beer Cheese Diaper 104 Ice Cream Diaper Beer Confirmatory 48

Data Mining: Explore the data for patterns: Data Mining methods automatically discover significant/interesting associations rules from data. Find whatever patterns exist in the database, without the user having to specify in advance what to look for. Therefore allow finding unexpected correlations Exploratory 49

The standard methods was developed by Agrawal et. al. (1994). http://citeseer.ist.psu.edu/mostcited.html http://rakesh.agrawal-family.com/ The Association Rules problem was defined as: Generate all association rules that have support >=minsup (minimum support) confidence >=minconf (minimum confidence) The algorithm performs an efficient search over the data to find all such rules. i.e., define N=#items in a rule 50

Example: Association Rules Client: One of the largest retailer corporations in the world. Business Goal Better Product Recommendations : What products are more likely to be purchased together? Association patterns in the sales? What is the right data set that you need for your analysis? 51

Example: Association Rules Individual Transaction Level Data set TRANSACTIONS (60,000 Rows, 4 Columns): Transaction ID Store ID Product Quantity 52

Raw Transaction Data 53

Variable Plot - Product 54

Variable Plot Store ID 55

Variable Plot - Quantity 56

Q: Which Product Has the Largest Quantity within a Single Transaction? Two-Way Plot 57

Q: Which Product Has the Largest Quantity within a Single Transaction? Two-Way Plot 58

Example: Association Rules Individual Transaction-Level Data set (60,000 Rows, 4 Columns): Transaction ID Store ID Product Quantity Which variable(s) do we need? (Do we need them all?) Which variable is the target(outcome) variable? Which variable(s) should define the grain of the analysis? 59

Run Association Rule

Results Window 61

Output Report 62

Rule Descriptions Lift 63

Recap: Association Rule Evaluation Criteria Consider the rule A =>B o Support ( Co-occurrence ) P(A,B) o Confidence ( Conditional occurrence ) P(B A) o Expected Confidence P(B) o Lift P(B A) P(A,B) P(B) = P(A)P(B) 64

Most methods for extracting association rules find too many trivial rules. Most are either obvious and uninteresting. Example: If Maternity Ward then patient is a woman. Confidence 100%, support 100% Need to screen for rules that are of particular interest and significance. Use domain specific conditions to filter out rules. Examples: Interestingness : Various measures for how surprising or unexpected a rule is. Example: A rule is interesting if it contradicts what is currently known (e.g., it contradicts a rule that was previously discovered). 65

Store planning: Placing associated items together (Milk & Bread)? May reduce basket total value (buy less unplanned merchandise) Fraud detection: Finding in insurance data that a certain doctor always works with a certain lawyer may indicate potential fraudulent activity. Is it useful for web site design? Is dissociation important? If A and NOT B C Database and Not Systems Analysis Business Intelligence 66

Instead of finding association between items in a single transactions, find association between items across related transactions over time. Customer ID Transaction Data. Item 1 Item 2 AA 2/2/2001 Laptop Case AA 1/13/2002 Wireless network card Router BB 4/5/2002 laptop ipaq BB 8/10/2002 Wireless network card Router Sequence : {Laptop}, {Wireless Card, Router} A sequences has to satisfy some predetermined minimum support 67

A restaurant menu with 100 items 161,700 combination of three items. A supermarket has >10,000 items in stock 50 million combination of two items 100 billion combination of three items Moreover, hundreds of millions of transactions 68

Are large fries and small fries the same product? Is the brand of ice cream more relevant than flavor? size, style, pattern, designer of clothing? Level (definition) of hierarchy! 69

Marketing: Customer segmentation (discovery of distinct groups of customers) for target marketing Car insurance: Identify customer groups with high average claim cost Property management: Identify houses in the same city with similar characteristics (Medical) Image analysis: discovery ROI in brain imaging to detect Alzheimer's Disease (AD). Creating document collections, or grouping web pages 71

Cluster: a group of data objects Similar to one another within the same cluster Dissimilar to objects in other clusters Cluster Analysis: Arrange objects into useful groups. Objects in each group share properties in common and have different properties from objects in other groups. 72

e.g., 28, $45K 29, $100K 55, $45K 56, $120K... Age? Income? Both? What if we have even more features? 73

Compact (instances are very close or similar to each other) Separate (instances among clusters are very different from each other) high within-class similarity low between-class similarity Similarity? 74

Key: What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. -- Webster's Dictionary 75

case gender glasses Moustache smile hat 1 0 1 0 1 0 2 1 0 0 1 0 3 0 1 0 0 0 4 0 0 0 0 0 5 0 0 0 1 0 6 0 0 1 0 1 7 0 1 0 1 0 8 0 0 0 1 0 9 0 1 1 1 0 10 1 0 0 0 0 11 0 0 1 0 0 12 1 0 0 0 0 Each user is represented as a feature vector 76

Need a distance measure for different cases (ectors) case gender glasses Moustache smile hat 1 0 1 0 1 0 2 1 0 0 1 0 Example for a distance measure: the Euclidean distance. X = [x1 x2 x3 x4 x5 ] Y = [y1 y2 y3 y4 y5 ] D( X, Y) n i 1 ( x i y i 2 ) 77

John: Age=35 Income=95K no. of credit cards=3 Rachel: Age=41 Income=215K no. of credit cards=2 D( X, Y) n i 1 ( x i y i 2 ) Distance (John, Rachel)=sqrt [(35-41) 2 +(95K-215K) 2 +(3-2) 2 ] Need normalization! 78

Each cluster is represented by the cluster center - mean Cluster center: the average object in the cluster eg, Clustering customers: (1) age (2) income. the cluster center is a virtual average customer with the average age of the customers in the cluster and the average income. Age Cluster center Income 80

(Age=43, Income=59K) Jane Income Mark (41, 51K) Distance Center/ Mean Rachel (45, 55K) (43, 55K) Age Cluster center is - Age=(41+43+45)/3=43 Income=(51K+59K+55K)/3=55K 81

1. Arbitrarily select K objects from the data (e.g., K customers) to be each cluster center 2. For each of the remaining objects: Assign each object to the cluster whose center it is most close to 10 9 8 7 Cluster center 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Cluster center 82

Then Repeat the following 3 steps until clusters converge (no change in clusters): 1. Compute the new center of the current clusters. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 83

2. Assign each object to the cluster whose center it is most close to. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 3. Go back to Step 1, or stop if center do not change. 84

Will k-means always converge to a solution? Yes! Will k-means always find the optimal solution? No! Caution! k-means may converge to a local optimal solution. 95

Strength? Relatively efficient Simple implementation Weakness? Need to specify k, the number of clusters, in advance. Can run into local optima. Unable to handle well noisy data and outliers. 96

Conventional data mining consider only hard factual demographics and behavioral data Attitudinal data used, at best, to provide a snapshot of the situation at a given point in time, not as decision tool. Need to combine hard factual data with soft attitudinal data for better decisions. 97

Customers are Multi-Faceted Demographics who they are? Behavioral what, when and where they buy? Attitudes why they buy? What do they feel Prestige, status, styling, image, Satisfaction, loyalty, perceived value, perceived service, perceived quality, Motivations, drivers, Mood, state-of-mind, happiness, depression, Only the combination of all three dimensions provides a full picture of the customer Incorporating attitudinal data makes data mining more humane focus rather than just action -focus Building customer relationships based also on what customers think and feel rather than just through price-product relationships 98

(Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 10 0

Predictive Modeling: Training vs. Validation Data Why do we need model validation? 10 1

Why Model Validation? Too few parameters Too many parameters 10 2

Overfitting Problem 10 3

Overfitting Problem Model training should stop here! 10 4

Prediction Overview Classification (vs. Regression) Decision Trees Regression Linear Regression Logistic Regression Naive Bayes SVM K-Nearest Neighbor (KNN) 10 5

An open competition to develop an algorithm to predict Netflix user s tastes in movies improve their current algorithm by 10% 10 7

Anonymized data about 480k Netflix users 100M ratings for 18k movies {userid, movie, date, rating} Participants had to train their models on this data (training set) and then apply it to some additional data (test data set) to predict users ratings. {userid, movie, date,?} Predictions were submitted to Netflix and they scored the performance based on root mean squared error (RMSE) 10 8

Technique for recommendation systems Make a recommendation for a given user by using preferences of other users (collaborative) user A user B items 1 2 3 4 5 6 7 The basic intuition: If users A,B share items 3,4,5 in common, then (prediction) maybe user A will also like 6,7 (or user B will also like 1,2) Problem 1: popular items may not be as informative as less popular items e.g., if everyone likes Lord of the Rings, then knowing two people like it may not indicate that they share similar preferences. One solution to this: weight the items users like according to the inverse of their popularity. Problem 2: cold start What about users that haven t rated many items or do not share anything in common with other users? e.g., in Netflix, you have to rate many movies before it will give you recommendations 10 9

Other Data Mining Challenges: http://www.kaggle.com/competitions 11 0

What is the heart attack risk of a person? Goal: Guess the missing values, given the known values. 11 1

Classification: If we are predicting a discrete value (e.g. heart attack risk high or low?) maximize proportion of correct predictions Regression: If we are predicting a real value (e.g. blood pressure rate, or probabilities) minimize mean squared error 11 2

Predictive task -> Mapping instances onto a predefined set of classes. Difference from Clustering? 11 3

Supervised learning (eg, classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (eg, clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 11 4

Inputs = Predictors = Independent Variables = Attributes Outputs = Responses = Dependent Variables = Class Labels = Target Variables Models = Classifiers With classification, we want to use a model to predict what output will be obtained from given inputs. How does a classifier work? 11 5

Input Output (Class Label) class 1 class 2 Classifier class k e.g., Are emails spam or not spam? e.g., What genre of movie is this? 11 6

Training Data Known Attribute Values Known Class Labels Step 1: Building the Model Model Training Validating Data Known Attribute Values Known Class Labels Step 2: Validating the Model Model Validating Testing on New Data Known Attribute Values Unknown Class Labels Step 3: Applying/Using the Model Output Predict Class Label 11 7

The essence of classification is to find rules to separate objects. Rules are like the criteria for separation Sometimes rules are not so obvious Sometimes there can be more than one rules How to Find Rules? 11 8

- An upside-down tree 11 9

A series of nested tests: Employed Root Each node represents a test on one attribute Tests on nominal attribute: number of splits (branches) is number of possible values Numeric attributes are Discretized Yes No Class=No Node Balance <50K >=50K Leaves A class label assignment (e.g, Default/No default) Class=Yes <45 Age >=45 Leaf Class=No Class=Yes 12 0

Q: Mark, no job, balance 60k, 40 years old Default or not? Employed? Yes No No Default Balance<50K? Yes No Default Age<45? Yes No No Default Default 12 1

The example is routed down the tree according to values of attributes tested successively. Employed? Yes No At each node a test is applied to one or more attributes. No Default Balance<50K? Yes No When a leaf is reached the example is assigned to a class. Default Age<45? Yes No No Default Default 12 2

Most popular methods: ID3, C4.5, C5.0 CART CHAID They Share the same structure Differ by: Growing the tree - Splitting criteria Pruning the tree Termination rules 12 3

4 bad +13 good Age>=45 16 bad +14 good Entire sample Balance>=50K Age<45 Default: 0 Not default: 10 Default: 4 Not default: 3 Balance<50K 12 bad +1 good Age>=45 Default: 4 Not default: 1 Bad credit (Default) Good credit (Not default) Age<45 Default: 8 Not default: 0 12 4

4 bad 13 good 3 bad 11 good 16 bad 14 good Balance>=50K 16 bad 14 good Age>=45 Balance<50K 12 bad 1 good Age<45 13 bad 3 good Which Attribute to Choose? 12 5

How do you make decisions? Apply for school? Apply for a job? Look for a girlfriend/boyfriend? 12 6

Decision Tree Construction A tree is constructed by recursively partitioning the training examples into purer sub groups. pure : homogeneous (most) examples at a leaf belong to the same class. Higher confidence for prediction! At each decision note, choose the most informative attribute that best partitions the population into increasingly purer sub groups. 12 7

4 bad 13 good 3 bad 11 good 16 bad 14 good Balance>=50K 16 bad 14 good Age>=45 Balance<50K 12 bad 1 good Age<45 13 bad 3 good 12 8

Purity Measures: Gini (population diversity) Entropy (information gain) Chi-square Test 12 9

The Gini measure of a node is the sum of the squares of the proportions of the classes. Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance) 0.1^2 + 0.9^2 = 0.82 (close to pure) 0.1^2 + 0.9^2 = 0.82 Gini( Split) WeightedAvg Gini( all child sets) Gini - The higher the better! 13 0

(4/17)^2 + (13/17)^2 =0.64 (3/14)^2 + (11/14)^2 =0.66 (14/30)^2 + (16/30)^2 =0.50 Balance>=50K Age>=45 Balance<50K Age<45 (12/13)^2 + (1/13)^2 =0.86 (17/30)*0.64+(13/30)* 0.86)=0.74 (13/16)^2 + (3/16)^2 =0.70 (14/30)*0.66+(16/30)* 0.70)=0.68 13 1

Purity Measures Information Gain Most common (Information Theory) Developed by Claude Shannon (1952) Information Theory, Quantum Theory and Relativity Theory were the three most influential breakthroughs in last century 13 2

Entropy and Information Gain Entropy How mixed/noisy is a set? ( uncertainty ) -Originally defined to account for the flow of energy through a thermodynamic process. Assume there are two classes, Pink and Green When the set of examples S contain p elements of class Pink and g elements of class Green p p g g ES ( ) log2 log2 p g p g p g p g Information Grain Expected reduction in entropy. e.g., How much closer to purity? 13 3

Information Gain as a Tree Splitting Criterion Assume that using attribute A, a current set will be partitioned into some number of child sets: A>1? A<=1? The information that would be gained by splitting on A InformationGain( Split) E( Parent set) WeightedAvg E( all child sets) Variable Worth The higher the better! 13 4

-(4/17)*log2(4/17) - (13/17)*log2(13/17) =0.79 -(3/14)*log2(3/14) - (11/14)*log2(11/14) =0.75 -(14/30)*log2(14/30) - (16/30)*log2(16/30) =0.99 Balance>=50K Age>=45 Balance<50K Age<45 -(12/13)*log2(12/13) - (1/13)*log2(1/13) =0.39 0.99 - [(17/30)*0.79+(13/30)* 0.39]= 0.37 -(13/16)*log2(13/16) - (3/16)*log2(3/16) =0.70 0.99- [(14/30)*0.75+(16/30)* 13 0.70]= 0.27 5

Brute-force search (exhaustive): At each node examine splits over each of the attributes Balance Select the attribute for which the maximum information gain is obtained <50K >=50K Example stopping rules: When to stop the partitioning? when maximum purity is obtained (i.e., all examples to reach the node are of the same class) when additional splits obtain no information gain if attributes have been used 13 6

Common issue: Overfitting When you have few data points, there are many possible splitting rules that perfectly classify the data, but will not generalize to future datasets. Yes Wears green? No e.g., the rule Wears green? perfectly classifies the data Female Male 13 7

Avoid Overfitting in Classification The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Symptom: good fit for training dataset, poor fit for testing data Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees 13 8

Decision Tree Classification in a Nutshell Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers To avoid overfitting Use of decision tree: Classification Test the attribute values of the sample against the decision tree 13 9

SAS EM Example A non-profit organization try to predict who are more likely to response to donation promotion (e.g., greeting cards) 14 0

SAS EM Example A non-profit organization try to predict who are more likely to response to donation promotion (e.g., greeting cards) What variables are needed? Target variable? (Output) Inputs? (Features) Level of observations? 14 1

SAS EM Example A non-profit organization try to predict who are more likely to response to donation promotion (e.g., greeting cards) Data Set: PVA97NK.csv 9,686 observations 28 variables Target variable: TargetB - Yes or No 14 2

Edit Variables 14 3

Edit Variables 14 4

Gift Amount - Distribution 14 5

Prediction Overview Classification (vs. Regression) Decision Trees Regression Linear Regression Logistic Regression 15 4

(Model-Free) Data Exploration & Visualization Unsupervised Learning (Pattern Discovery) (Market Basket Analysis, Association Rule, Clustering) Supervised Learning (Predictive Modeling) (Decision Tree, Linear Regression, Logistic Regression) 15 5

Form a group of 4 or 5; Write a summary report about the major methodologies of machine learning methods, including both supervised and unsupervised learning; Compare the pros and cons for each method; Find a real-world business application of each method you discuss; Page limit 5-10 pages, in English; Due: Last day of class. 15 6

Day 1: BI & DA Overview, Business Cases - Individual Assignment Day 2: Machine Learning & Data Mining Basics - Group Assignment Day 3: Predictive Modeling vs. Causal Inferences - How to Interpret Regression Results - Causal Identification Strategies; - Economic Value of Online Word-of-Mouth; - Social Network Influence; - Multichannel Advertising Attribution; - Randomized Field Experiment of Mobile Recommendation. Day 4: Bridging Machine Learning with Social Science: - Case 1: Interplay Between Social Media & Search Engine; - Case 2: Understand and Predict Consumer Search and Purchase Behavior; - Case 3: Text Mining & Sponsored Search Advertising. 15 7