Hexaware Webinar Series Presents: Data Mining - Seeing the future and knowing the patterns of your business using your organization data Rajesh Natarajan Senior Consultant, Hexaware Technologies Dec 5 th, 12 pm Eastern Time The Presentation Will Begin Momentarily
A Global IT and BPO Service Provider India s Fastest Growing Mid- Sized Company 32 offices worldwide 55 Global 500 clients 18 Global locations 17 years of technology outsourcing expertise Transportation 166 Clients served worldwide 187 USD mil Revenues, 06 BFS 6900 Employees worldwide Business Areas ERP/HRIT Insurance Our mission : To build value for clients through innovative use of technology and talent
Strategies and Strengths L L E E A A D D E E R R S S H H I I P P T T H H R R O O U U G G H H F F O O C C U U S S Leading BFSI service provider with proprietary products (Operational Risk, Collections, Leasing, Wealth Management # 1 Airlines services provider in India 8 of top 10 airlines are our clients # 1 provider of HR-IT services in India 500+ projects, 750+ resources Specialized Insurance service provider Content management, Fraud Mgmt, Work flow, SOX, BPO Core Competency Management of businesscritical applications offshore Organization Traits Consultative approach, Responsive and Resultoriented Robust Backbone World-class infrastructure, Flexible delivery models, SEI CMMi Level 5, BS7799 Track Record 88% Repeat Business Offshore transition expertise Global Delivery E E N N H H A A N N C C I I N N G G V V A A L L U U E E
Data Mining - Seeing the future and knowing the patterns of your business using your organization data
Agenda Introduction: Data Mining and its necessity Data Mining Vs OLAP/Statistics A Perspective on Data Mining: Functionalities and Tasks A Process-oriented view of Data Mining Important Data Mining Techniques Regression Neural Networks Cluster Analysis Data Mining Applications across different domains Banking analytics Insurance Analytics Airlines Analytics Retail Analytics Data Mining Applications across different business functions CRM Analytics HR Analytics Hexaware s Analytics Offerings
Current Business Landscape Increased competition due to globalization Barriers to entry reduced due to factors such as the Internet, Outsourcing and other innovative trends Advances in technology creating a level playing field resulting in smaller margins Lower life-span of product and service models due to dynamic environmental conditions Increasing pressure to reduce the time to market. Higher customer expectations: quality, cost and customization Products catering to individual customers Marketing to one in a population of six billion Advances in Data capture, processing, storage and retrieval technologies Data flood & Information overload Changing Changing factors factors of of Competitive Competitive advantage advantage Size Size Economies Economies of of scale scale Economies Economies of of scope scope Technology Technology Process Process efficiencies efficiencies New New Services Services Knowledge Knowledge Process Process and and Product Product Innovation Innovation
Need for Business Intelligence and Analytics Data capture and storage has become cheap and convenient Transactional information captured as part of the business process Valuable customer, business and stakeholder information hidden in captured data Development of new disciplines that enable extraction of information from captured data Extracted information can be used as a competitive advantage to drive business innovations As the trend to gather more data from all kinds of processes will increase - so will the competitive pressure to derive value out of it. Analytics have increased in importance as enterprises recognize their potential for alleviating the paralyzing condition known as "info glut" an overwhelming information and data overload. Enterprises may pay for their failure to invest in analytics with decreased productivity and inferior decision making. -Gartner
Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s 2000s: Data mining and data warehousing, multimedia databases, and Web databases
Information Dimensions and Data Sources Customer Information Business Information Enterprise Operational Information Economic Indicators Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and objectrelational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW 9
What is Data Mining? (Knowledge Discovery in Databases) Data Mining or Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data Validity: The discovered patterns must be valid on new data with some degree of certainty. Novelty: The patterns are novel (at least to the system). Novelty is measured with respect to changes in data Comparison of current values to previous or expected values Potentially Useful: The discovered patterns should lead to some useful actions Ultimately Understandable Enable better understanding of the underlying data and hence the domain
Data Mining is a multi-disciplinary field Database Management Systems Visualization Machine Learning Data Mining Statistics Artificial Intelligence Pattern Recognition Expert Systems 11
What is Data Mining (Advanced Analytics)? Some experts look at Data Mining as a stage in the KDD (Knowledge Discovery in Databases) process Secondary analysis of data collected for some other purpose Customer segmentation using customer transactional and demographic data Traditional statistical techniques may not be applicable due to large sizes of the datasets Data Mining is an iterative and interactive process Emphasis is on automated/semi-automated techniques End-user (business-user) involvement is an important goal Descriptive and Predictive techniques answer the why, how, who, and what will happen type of business questions.
Moving from Data To Insights
Gartner s Hype Cycle
OLAP Vs Data Mining On-Line Analytical Processing (OLAP) Presentation Tool, Reports on Data Designed for faster response Metrics can be viewed at any level of detail Hypothesis driven User makes hypotheses about the data and looks at the data for confirmation of hypothesis Problems arise when there are a large number of variables Data Mining Not a Presentation tool Discovers hidden, implicit, significant and actionable patterns in the Data Brings out all the Hypothesis fitting the data (exploratory analysis) No-bias, Lets the data talk Designed to produce generalizations about the data
Statistical Analysis and Data Mining
Statistics and Data Mining Statistical Analysis Analysis of primary data Data collected to test specific hypothesis Experimental data also collected Data Mining Secondary data that is collected for other reasons Unbiased, but important data could be missing Typically, observational data Data Mining deals with large data bases Many databases do not lead to classical form of data organization, example, data that comes from the Internet There should be a link between the results of data mining and business actions
Statistics and Data Mining Two main criticisms of Data Mining In Data Mining, there is not just one theoretical model but several models in competition with each other Model is chosen depending on the data available Criticism 1: It is always possible to find a model, however complex, which will adapt well to the data Criticism 2: Great amount of data might lead to non-existent relationships being found in the data While choosing models great attention is paid to the possibility of generalizing results. Predictive performance is considered and more complex models are penalized
The Primary Tasks of Data Mining High-level primary goals of Data mining Prediction Description Prediction Using some variables or fields in a database to predict unknown or future values of a variable of interest. Regression Classification Description Focus is on finding human-interpretable patterns describing the data Clustering Summarization In the context of Data Mining description tends to be more important than prediction. 19
Data Mining Functionalities or Tasks (1) Concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Association (correlation and causality) Multi-dimensional vs. single-dimensional association age(x, 20..29 ) ^ income(x, 20..29K ) buys(x, PC ) [support = 2%, confidence = 60%] contains(t, computer ) contains(x, software ) [1%, 75%] 20
Data Mining Functionalities or Tasks (2) Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity 21
Data Mining Functionalities or Tasks (3) Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity-based analysis 22
A Process-oriented view of Data Mining 24
Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Task-relevant Data Data Mining Pattern Evaluation Data Warehouse Selection Data Cleaning Data Integration Databases
Data Mining Process
Important Data Mining Techniques Regression Neural Networks Cluster Analysis
1. Regression: A Prediction Technique Linear Regression -Target is continuous or interval based. Logistic Regression -Target is discrete. Regression uses following variable selection techniques: Forward Selection- First selects the best one variable model and then best two variable model and so on. Backward Selection - It begins with a full model and starts dropping variable one by one that is least significant. (cutoff the stay p-value) Stepwise Selection-Modification of Forward model where variables already in the model might be removed depending on their level of significance. For Example: From a set of variables Age, Income,Gender, Location, Occupation,Years of experience, Education and many others what technique one can use for Variable selection?
2. Neural Networks 1. Neural networks are a class of flexible non-linear models used for supervised prediction problems. 2. Based on the functioning of the human brain. 3. Enables us to construct, train and validate multilayer feed forward neural networks. 4. Each hidden unit is a non-linear transformation of a linear combination of their inputs. 5. Excellent predictive ability but difficult to interpret the results 6. Learning mechanism: trains itself over time and fresh data For example, Neural networks are industry standard to detect fraud in Credit card industry.
Neural Networks: Examples 1. Neural network model can be used to study the effectiveness of a particular campaign and helps us to analyze whether or not they responded to a recent promotion. 2. This can further be used for prediction in future. 3. Target Response (Yes= 1, No=0) 4. Input Age,Income,Married, FICO,GENDER, OWN HOME, No. of purchases, Value of purchases and others. 5. Weights and Graphs of variables are shown as output to study the weight of input variables. 6. The response rate of the campaign can be increased using the results from these models by targeting the right customers.
3. Cluster Analysis 1. Cluster Analysis is an Unsupervised Classification method that divides into classes that are homogeneous with respect to inputs. 2. Clustering is based on the principle: maximize the intraclass similarity and minimize the interclass similarity at the same time. 3. It helps in segmenting existing customers into groups and associating a distinct profile with each group to help in future marketing strategies. 4. Common technique - K-Means Clustering, Hierarchical clustering, etc.
Cluster Analysis: Examples Not all customers can be fitted into a single model. Clustering helps organize data into clusters and then fit an individual model for each cluster. For a customer profile analysis for marketing, up selling or default study clustering allows to classify datasets. Clusters can give us results as follows: Cluster 1- Married Persons living in Climate Zone 10 Cluster 2- Married Persons living in Climate Zone 30 Cluster 3- Married Persons living in Climate Zone 20 Cluster 4- Younger Un Married Persons, lower FICO score living in Climate Zone 10 Cluster 5- Younger Unmarried Persons, higher Incomes living in Climate Zone 20 Cluster 6- Younger,unmarried women living in Climate Zone 20 or 30. These clusters can be used for marketing, depending on line of business and campaigns. For ex- Cluster 1 may have very high percentage of people who have taken home loans, so customers with similar profile can be targeted for Home loan.
Clustering Results
Application Areas Applications Prospect targeting Call planning Marketing optimization Sales force optimization Propensity to churn Customer Segmentation Performance attribution Funds/fees analysis Fund benchmarking Revenue forecasting Demand forecasting Probability of default Loss given default Probability of claim Underwriting Scoring Fraudulent identification ALM /FTP models (core segregation, attrition, matched maturity) Economic capital modeling Clinical Intelligence Resource utilization analysis Mining Techniques Classification / Categorical Analysis Logistic Regression, Support Vector Machine, Naïve Bayes, Adaptive Bayes Clustering / Association / Neighborhood Analysis K Means, K Nearest Neighbor, O Cluster, Association Decision Trees CART / CHAID Parameter Selection & Improvement Factor Selection/Non-Negative Matrix Factorization / Attribute Importance Forecasting Optimization Others Regression, ANOVA
Data Mining applications across different domains
Banking Analytics
Use of Decision Trees in Banking 1. Decision trees are used to select the best course of action in situations where you face uncertainty. 2. A decision tree is a predictive model; that is, a mapping of observations about an item to conclusions about the item's target value. For example, What rules should a bank follow so that the response rate for a population to a Personal Loan marketing is 80%?
Case Study for Decision Tree N = 5000 10% bad Debt _Inc < 45 Yes N=2500 1% BAD Yes N = 3350 5% BAD No. of Delinquent Trade lines < 2 No N = 1000 15% BAD Yes N = 650 4% BAD N =1650 21% BAD No Income > 500000 p.a. No N = 1000 30% BAD Note:-Probability of a bad home loan is higher if Debt_Inc > 45 and Income< 5000000 p.a.
Logistic Regression in Banking Logistic regression can be used to conduct a marketing campaign up selling a credit card to customers that have accounts in banks. Predicted variable- Target in terms of 1 or 0, where Possessing a Credit Card=1 Not possessing a Credit card =0 Input variables: Income Age Home Owner Service Account Age Account Balance Other credit cards Note-This will help us detect the most important predictor variables for effective targeting of Credit Cards.
Multiple Linear Regression: Example A payment behavior modeling of customers can be done through multiple regression model. Input variables: Income Age Gender Payment Pattern No. of times delinquent Homeowner No. of dependents Total amount of loans Target Variable- Ordinal variables(1-10) can be used to signify bad- excellent payment patterns. Prediction of future payment can be done based on this multiple regression model.
Fitting a Regression Model
Regression Results
Insurance Analytics
Analytics in Insurance Claims Servicing Underwriting Customer Management Modeling Probability of Claim Optimize Claims Processing Detect and Prevent Claims Fraud Pure Premium Modeling Claim Frequency Modeling Claim Severity Modeling Claims Estimation Analysis Pricing of insurance policies Risk assessment and Pricing Portfolio Risk Management Business Performance Analysis Customer Segmentation Attrition / Prediction Analytics Cross- Sell, Up sell of Products Development of New Products Customer Retention Strategy Data Mining Techniques Clustering Association Analysis Linear Regression Logistic Regression Variable Selection Decision Trees Multivariate Analysis Neural Networks Oultlier Detection Support Vector Machine
Regression Outcome Shows the effect of each independent variable on probability of Claim.
Airlines Analytics
Customer Analytics Predicting Campaign Effectiveness and targeting the correct customer segments Business Goal Optimize marketing effectiveness by: minimizing the cost per campaign per customer maximizing revenue per campaign per customer Problem Definition To achieve the goal of optimizing marketing effectiveness the following parameters that yield the most optimal results have to be identified: customer segments campaign medium offer type Solution Approach Use historic campaign data to statistically determine customer segments Analyze response data from past campaigns with respect to revenue achieved from the campaign for a given customer Use the above analysis to come up with a model that predicts probability of revenue generation given a combination of the parameters customer demographics, campaign medium, offer type For all future campaigns use the model to effectively use the right medium to the appropriate customer segment
Campaign Analysis & Segmentation Customer Segmentation based on Campaign Channel in each Campaign Type a. Retail Fare Sale i. E-mail ii. Direct Mail iii. Advertisement iv. Partner b. Competitive Response sale i. E-mail ii. Direct Mail iii. Advertisement iv. Partner c. Frequent Flyer Acquisition Drive i. E-mail ii. Direct Mail iii. Advertisement iv. Partner d. Partner Offer i. E-mail ii. Direct Mail iii. Advertisement iv. Partner Maximum Response Rate & Profitability In which Channel for which campaign type
Defect Cost Analysis Defect wise cost analysis across different part types and quarters
Retail Analytics
Market Basket Analysis: Association Rule Mining An association rule is a statement of the form (Beer) (Diapers ). Support of the rule Beer Diaper is estimated by number of transactions that contain Beer and Diaper Total number of transactions in the database Confidence of an association rule Beer Diaper is the conditional probability of a transaction containing item set B given that it contains item set A. transactions that contain Beer and Diaper transactions that contain only Beer
Case Study for Association Analysis Checking Account No Yes Savings Account No Yes 500 3500 1000 5000 4000 6000 10000 Support (SVG CK) = 50% Confidence (SVG CK) = 83% Lift (SVG CK) =.83/.85 < 1 This shows a strong rule. Also, those without a savings account are even more likely to have a checking account(87.5%) Savings and checking are negatively correlated.
Association Analysis - Results
Data Mining applications across different business functions
CRM Analytics
CRM Analytics New Business Analysis Understanding Customer Profitability Understanding Customer Retention/Attrition Deepenening Customer Relationships Cross selling / Upselling Cross Product Holding Acquisition Pattern Service Request Analysis Customer Behavior Targeted Campaign Response analysis Customer Profiling Analytic Channel Behavior Process Maps Analytic and Mining Models Off the Shelf CRM Tools
CRM Process Flow Cross Sell Upsell Churn Retention Segmentation Customer transition to higher segment Campaign Objective Campaign Effectiveness Evaluation Model (Response & Returns to Cost) Reporting & Analysis (use SAS BI/ Micro strategy/ms Excel) Owner of Campaign Budget,Costs and Segmentation Other constraints Campaign Description Customer Selection Assignment Model Right Product/Right Customer/Right Channel using SAS linear programming Tele Marketing Direct Mail Customer Attributes Age/ Gender/ Relationship details Data Mining Techniques Model Selection Customer Exclusions Criteria Customer Selection Based on other attributes Filtered Customer List Select Customers who did not Respond to past campaign and are poor on other parameters Response Propensity Purchase Score Credit Score Relationship value Filtered Customer List based how old Is relationship Other Business Criteria Split Customer List Response To Multiple Campaigns Stored as history Siebel Schedules & Executes Campaign Walk in to Advisors Other Medium Data Mining Techniques Response Propensity Score for maximum likelihood of response Model Selection
HR Analytics
HR Analytics - The potential Demand Vs Supply Growth Vs Capability Aspiration Vs Actual Required Vs Skill Separation Vs Retention Benefits Vs Cost Rewards Vs Budgets Measurement Vs Measurable Hire to Retire HR Function Challenges Demographics Productivity Health & Safety Relations & Satisfaction Turnover & Mobility Planning Staffing & Recruiting Compensation & Benefits Training & Development
Hexaware HR Analytic Domain Coverage Recruitment & Attrition Succession Planning Learning & Skill dev Workforce HR Functions Compensation & Rewards Potential & Performance Derive Hindsight, Insight and Foresight into your HR to derive Strategic imperatives DW,OLAP & Reporting Hindsight Statistical & Data Mining Techniques Foresight & Insight Data models/hierarchy/metrics Reports DB / KPIs Compensation Analysis Rewards & Benefits Administration Salary / Compensation information including overtime costs Cost per Hire Employee Analysis Work Force Profile & Compliance Analysis\ Headcount/Turnover HR Performance Analysis Impact of absenteeism and tardiness Training cost per employee Impact on performance Recruiting & Talent Management
Workforce Analysis 1.Work Force Analysis a. Headcount Analysis b. Turnover Analysis c. Hiring Trend Analysis d. Overtime Usage Analysis e. Affirmative Action Analysis 3. Benefits Analysis a. Plan Participation Analysis c. Benefits Cost Analysis d. Disability Fraud Analysis Hexaware Jumpstart Analytic Pack 2. Compensation Analysis a. Variable Compensation Analysis b. Merit Distribution Analysis c. Compa-Ratio Analysis d. Total Compensation Analysis e. Year Over Year Analysis
Hexaware s Analytics Offerings
Analytics & Data Mining
Q & A Q & A You can also reach us at biinnovations@hexaware.com
Thank You For Attending For a recording of this webinar please visit: http://www.hexaware.com/webcastarchive1.html Download Case Studies & Whitepapers at: www.hexaware.com Upcoming Webinars Hexaware Webinar Series How can banks leverage analytics across various perspectives protecting their current investment in technology? Extend PeopleSoft Enterprise Applications with Oracle Fusion Register Today!