Nordic GSE 2013, S506 Exploitation of Predictive Analytics on System z End to End Walk Through Wang Enzhong (wangec@cn.ibm.com) Technical and Technology Enablement, System z Brand IBM System and Technology Group Wei Kewei (weikewei@cn.ibm.com) DB2 for z/os Optimizer Development and Tuning, IBM
Agenda Overview on Business analytics on system z Cross-sell end-to-end solution Real-time Anti-fraud detection for credit card(case Study and Best Practice) Q&A 2
Business Analytics Landscape Stochastic Optimization Optimization How can we achieve the best outcome including the effects of variability? How can we achieve the best outcome? Prescriptive Predictive modeling What will happen next? Analytics Competitive Advantage Simulation Forecasting Alerts Query/drill down Ad hoc reporting What will happen if? What if these trends continue? What actions are needed? What exactly is the problem? How many, how often, where? Predictive Reporting Descriptive Standard Reporting What happened? Degree of Complexity Increasing prevalence of compute and data intensive parallel algorithms in commercial workloads driven by real time decision making requirements and industry wide limitations to increasing thread speed. Based on: Competing on Analytics, Davenport and Harris, 2007 3
Critical Components in Contemporary BA Solutions Data Warehouse Data Intensive Data Marts DW/BA App. Server ELT ELT Batch Copy BA Modeler Engine Numerically Intensive Operational Source Systems Structured/ Unstructured Data 6/12/2013 Transactional System Compute Scores BA Scoring Engine 4
DB2 Analytics Accelerator V3.1 Capitalizing on the best of both worlds System z and Netezza What is it? The IBM DB2 Analytics Accelerator is a workload optimized, appliance add-on, that enables the integration of business insights into operational processes to drive winning strategies. It accelerates select queries, with unprecedented response times. How is it different Performance: Unprecedented response times to enable 'train of thought' analyses frequently blocked by poor query performance. Integration: Deep integration with DB2 provides transparency to all applications. Self-managed workloads: queries are executed in the most efficient location Transparency: applications connected to DB2 are entirely unaware of the Accelerator Simplified administration: appliance hands-free operations, eliminating most database tuning tasks Breakthrough Technology Enabling New Opportunities 5
DB2 Analytics Accelerator V3.1 Lowering the costs of trusted analytics What s New? High Performance Storage Saver Store a DB2 table or partition of data solely on the Accelerator. Removes the requirement for the data to be replicated on both DB2 and the Accelerator Incremental Update Enables tables within the Accelerator to be continually updated throughout the day. zenterprise EC12 Support Version 3 will support the zenterprise EC12, z196 and z114 System z platforms Query Prioritization Brings System z workload management down to the individual query being routed to the Accelerator High Capacity Support has been extended to include the entire Netezza 1000 line (1.28 PB) UNLOAD Lite Reduces z/os MIPS consumption, by moving the preparation off System z. 6
Deep DB2 Integration within zenterprise Applications DBA Tools, z/os Console,... Application Interfaces (standard SQL dialects) DB2 for z/os Operational Interfaces (e.g. DB2 Commands) Data Manager Buffer Manager... IRLM Log Manager IBM DB2 Analytics Accelerator Superior availability reliability, security, Workload management z/os on System z Superior performance on analytic queries 7
Bringing Netezza AMPP TM Architecture to DB2 AMPP = Asymmetric Massively Parallel Processing CPU FPGA Advanced Analytics Memory BI DB2 for z/os SMP Host CPU Memory FPGA Legacy Reporting CPU FPGA DBA Memory Network Fabric S-Blades IBM DB2 Analytics Accelerator Disk Enclosures 8
Large Insurance Company Business Reporting we had this up and running in days with queries that ran over 1000 times faster Total Rows Reviewed DB2 Only Total Rows Returned Hours Sec(s) Hours Sec(s) Times Faster Query Query 1 2,813,571 853,320 2:39 9,540 0.0 5 1,908 Query 2 2,813,571 585,780 2:16 8,220 0.0 5 1,644 Query 3 8,260,214 274 1:16 4,560 0.0 6 760 Query 4 2,813,571 601,197 1:08 4,080 0.0 5 816 Query 5 3,422,765 508 0:57 4,080 0.0 70 58 Query 6 4,290,648 165 0:53 3,180 0.0 6 530 Query 7 361,521 58,236 0:51 3,120 0.0 4 780 Query 8 3,425.29 724 0:44 2,640 0.0 2 1,320 Query 9 4,130,107 137 0:42 2,520 0.1 193 13 DB2 Analytics Accelerator (Netezza 1000-12) Production ready - 1 person, 2 days Choose a Table for Acceleration Table Acceleration Setup in 2 Hours DB2 Add Accelerator Choose a Table for Acceleration Load the Table (DB2 Loads Data to the Accelerator Knowledge Transfer Query Comparisons DB2 with IDAA Initial Load Performance 400 GB Loaded in 29 Minutes 570 Million Rows Loaded 800 GB to 1.3 TB per hour Extreme Query Acceleration - 1908x faster 2 Hours 39 minutes to 5 Seconds CPU Utilization Reduction to 35% 9
Critical Components in Contemporary BA Solutions Data Warehouse Data Intensive Data Marts DW/BA App. Server ELT ELT Batch Copy BA Modeler Engine Numerically Intensive Operational Source Systems Structured/ Unstructured Data 6/12/2013 Transactional System Compute Scores BA Scoring Engine 10
IBM SPSS Technology IBM SPSS technology drives the widespread use of data in decisionmaking through statistics-based analysis of data and the deployment of predictive analytics into the decision making process What is predictive analytics? Predictive analytics is a business intelligence technology that predicts what is likely to happen in the future by analyzing patterns in past data Predictions are delivered in the form of scores generated by a predictive model that has been trained on historical data Assigning these predictive score is the job of a predictive model which has been trained over your data With predictive analytics, the enterprise learns from is cumulative experience (data) and takes actions to apply what has been learned 11
Business Analytics Real Time Predictive Analytics Support for both in-transaction and indatabase scoring on the same platform End to end solution Customer Interaction Data In Business System / OLTP ETL DB2 for z/os Data Historical Store Real-Time Score/ Decision Out Application w/latest data DB2 for z/os R-T, min, hr, wk, mth Copy Reduced Networking Meet & Exceed SLA Automated Model Updates SPSS Modeler For Linux on System z Scoring Algorithm Consolidates Resources 12
The Three Pillars of Predictive Analytics Acquire customers: Understand who your best customers are Connect with them in the right ways Take the best action maximize what you sell to them Grow customers: Understand the best mix of things needed by your customers and channels Maximize the revenue received from your customers and channels Take the best action every time to interact Retain customers: Understand what makes your customers leave and what makes them stay Keep your best customers happy Take action to prevent them from leaving Predictive Operational Analytics Manage Maintain Maximize Manage operations: Maximize the usage of your assets Make sure inventory and resources are in the right place at the right time Identify the impact of investment Maintain infrastructure: Understand what causes failure in your assets Maximize uptime of assets Reduce costs of upkeep Maximize capital efficiency: Improve the efficiency and effectiveness of your assets Reduce operational costs Drive operational excellence in all phases: procurement, development, availability and distribution Predictive Threat & Fraud Analytics Monitor Detect Control Monitor environments: Identify leaks Increase compliance Leverage insights in critical business functions Detect suspicious activity: Identify fraudulent patterns Reduce false positives Identity collusive and fraudulent merchants and employees Identify unanticipated transaction patterns Control outcomes: Take action in real-time to prevent abuse Reduce Claims Handling Time Alert clients of transaction fraud 13
Data Warehousing and Business Analytics with zenterprise and IBM DB2 Analytics Accelerator zenterprize EC12/ 196 and z114* IBM DB2 Analytics Accelerator Data Warehouse (DB2 for z/os) Data Intensive Data Marts DW/BA App. Server (Cognos 10.1 BI) ELT Operational Source Systems Structured/ Unstructured Data 08/28/ 12 ELT Infosphere Data Warehouse 9.7.3 Information Server Batch Copy Transactional System (DB2 for z/os) DB2 UDFs Compute Scores BA Modeler Engine (SPSS Modeler 15) Numerically Intensive BA Scoring Engine (DB2 10 Accessory Suite) Complex, analytical queries requiring extensive table scans of large, historical data sets run on IBM DB2 Analytics Accelerator. Results returned from the analysis can be joined with current or near real-time data in the data warehouse on the System z to deliver immediate recommendations, creating, in effect, a high performance opera-tional BI service. 14
OnLine Transactional and Analytics Processing (OLTAP) Operational Systems (OLTP) Operational Systems (OLTP) Enterprise Data z/os Warehouse LPAR Enterprise EDWH Data Warehouse Data Sharing Group RT Trx Scoring DB2 for z/os OLTP plus Add. dimension ELT ETL DB2 for z/os EDWH Batch Scoring Model Refresh Batch Scoring IDAA using Netezza technology Model Construction DB2 for z/os EDWH Static data Batch Scoring SPSS Statistics and Modelling InfoSphere Warehouse on System z SQW and Cubing Services Cognos BI and Reporting Linux on System z Linux on System z z/os, Linux System z 15
Agenda Overview on Business analytics on system z Cross-sell end-to-end solution Real-time Anti-fraud detection for credit card(case Study and Best Practice) Q&A 16
User Scenario Real time product recommendation Predictive Analytics helps connect data to effective action by drawing reliable conclusions about current conditions and future events Gareth Herschel, Research Director, Gartner Group By learning from the shopping history data, recommend what the customer is likely to buy in real time to increase cross-sell. 17
Data Mining Methodology CRISP-DM Business understanding: Business understanding includes determining business objectives, assessing the situation, determining data mining goals, and producing a project plan. Data understanding:this phase addresses the need to understand what your data resources are and the characteristics of those resources. It includes collecting initial data, describing data, exploring data, and verifying data quality. Deployment:This phase focuses on integrating your new knowledge into your everyday business processes to solve your original business problem. This phase includes plan deployment, monitoring and maintenance, producing a final report, and reviewing the project. Data preparation:preparations include selecting, cleaning, constructing, integrating, and formatting data. Modeling:sophisticated analysis methods are used to extract information from the data. This phase involves selecting modeling techniques, generating test designs, and building and assessing models. Evaluation:evaluate how the data mining results can help you to achieve your business objectives. Elements of this phase include evaluating results, reviewing the data mining process, and determining the next steps. CRISP-DM:CRoss-Industry Standard Process for Data Mining 18
Predictive Analytics Process Real-Time Predictive Analytics Input Data 3. Deploy model OLTP Scoring 1.Select Model 2. Train Model using historical data Data WareHouse orderkey partkey suppkey linenumber quantity 15711815 27 12636 1 26 15711815 40 16799 2 47 15711815 39 17390 3 41 15711815 45 7496 4 19 15711815 32 21483 5 37 15711815 11 18212 6 48 15711815 28 22274 7 13 Model Generation 19
Cross-Sell Solution Overview Incoming Transaction Business Understanding Recommend relative product based on historical data to grow profit Data Understanding Distorted TPCH data, which is in e-commercial industry, to simulate real situation Deployment Deploy the model into OLTP environment as UDF in DB2 for z/os and web-service on zlinux. Evaluation Us another dataset to evaluation the pattern are correct or not Data Preparation Transform order detail into vector group by order id. Modeling Association algorithm Aprori to discover what products are usually purchased together. 20
Data Understanding PART LINEITEM ORDERS PARTKEY NAME MFGR ORDERKEY PARTKEY SUPPKEY * 1 ORDERKEY CUSTKEY ORDERSTATUS * BRAND LINENUMBER TOTAL PRICE TYPE QUANTITY ORDER DATE SIZE PARTSUPP EXTENEDE PRICE ORDER PRIORITY CONTAINER PARTKEY DISCOUNT CLERK RETAIL PRICE SUPPKEY TAX SHIP PRIORITY COMMENT SUPPLIER SUPPKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT AVAILQTY SUPPLY COST COMMENT RETURN FLAG LINESTATUS SHIP DATE COMMIT DATE RECEIPT DATE SHIP INSTRUCTION SHIPMODE COMMENT COMMENT CUSTOMER CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL 1 PART SUPPLIER PARTSUPP CUSTOMER ORDERS LINEITEM Product information Supplier information Information of product in certain supplier Customer information Order information Product purchased in one order MKTSEGMENT COMMENT 21
Data Preparation Source : Shopping history data orderkey partkey suppkey linenumber quantity 15711815 27 12636 1 26 15711815 40 16799 2 47 15711815 39 17390 3 41 15711815 45 7496 4 19 15711815 32 21483 5 37 15711815 11 18212 6 48 15711815 28 22274 7 13 Source Data Ouput Output: Vector by order partkey orkerkey 0 11 27 28 32 39 40 45 49 15711815 0 0 10..0 1 10..0 10..0 1 10..0 1 0..0 0 22
Modeling partkey orderkey 0 11 27 28 32 39 40 45 49 15711815 0 0 10..0 1 10..0 10..0 1 10..0 1 0..0 0 Model Input Model Setup: Given the Antecedent number, the top 3 Consequent with highest Confidence% will return as the most likely to buy products. 23
Deployment SPSS Modeler Trained Model In-Database SPSS Adapter DB2 for z/os Application Server Application Server Collaboration and Deployment Service Repository Service Scoring Service Application Server Application Server 24 Ex-Database
Scoring In-database Scoring Ex-database Scoring 25
Scoring Application The top 3 product with highest score which means the customer is mostly likely to buy are recommended 26
Agenda Overview on Business analytics on system z Cross-sell end-to-end solution Real-time Anti-fraud detection for credit card(case Study and Best Practice) Q&A 27
User Scenario Real Time Credit Card Fraud Detection Customer Action: Credit Card Payment Message: Authorization Request Contact customer, investigate the case Authorization System Action: 1.Validation 2.Scoring Authorization Request Fraud Detection Engine Action: 1.Scoring 2.Case Opening 号 Action: Assign Tasks Case Management System Anti-fraud Investigator Case Manager 28
Real-time Anti-fraud Detection Overview Incoming Transaction Deployment Deploy the model into OLTP environment as UDF in DB2 for z/os. Business Understanding Identify fraud credit card transactions based on historical data to mitigate risks Data Understanding N years historical data from a real banking customer Data Preparation Transform transaction detail into vector group by account id. Evaluation Use N months historical data to evaluate the pattern are correct or not Modeling Neural Network algorithm to assign a fraud score to the credit card transaction. 29
Data Understanding Transaction Message Plastic Number Transaction number Transaction amount Transaction currency Message format of incoming transaction High Risk Region Table Transaction Region number Number of region which is marked as high risk. High Risk Business Table TXN History Table Unique Transaction ID Issuer Branch Zone Account No Card No Card Flag Card Type Expiration Date Credit Limit Customer Table Customer ID Name Gender Customer Information Feature Table (New) Account Number Transaction Business number Number of region which is marked as high risk. Available Balance Control Balance Transaction Type CR/DR Account Table Timestamp of last transaction Accumulated Attributes (36 attributes) Derived Attributes(24 attributes) Transaction Amount Transaction Currency Post Amount Account No Account Type Balance Timestamp of last transaction: records the timestamp of last transaction on this card which is to used calculate time different between current transaction and last transaction to omit those happened over 7 days. Accumulated/derived Attributes: variables to be feed into model for scoring and more detail on next slides Post currency Transaction Region number Transaction History Account Information 30
Feature Table Feature Table: Account Number Timestamp of last transaction Accumulation Attributes Derived Attributes(ratio or difference between accumulation attributes) :real-time calculation 3H 1D 7D 7D-1D 3H/7D :calculate in batch the base which will be increased by current transaction in real time. Exp: for accumulation in last 7 days, the accumulation data within last 7 days will be accumulated in batch and increased by current transaction in online process. Accumulation amount and number of TXN in high risk region Accumulation amount and number of TXN in early morning Accumulation amount and number of TXN happened in high risk merchant Accumulation amount and number of failed TXN Accumulation TXN amount and number Accumulation amount and number of TXN in high risk region Accumulation amount and number of TXN in early morning Accumulation amount and number of TXN happened in high risk merchant Accumulation amount and number of failed TXN Accumulation TXN amount and number Derived attributes will be calculated in real-time based on accumulated attributes Accumulation amount and number of high amount TXN Accumulation amount and number of high amount TXN 31
End-to-end Flow TR logic Transaction task 1 Real Time F calculation (For 3 hours/1 Day) Batch feature calculation task Get profile Business rules 2 1 5 6 Prepare SPSS adapter parameters Call SPSS adapter 4 Feature calculation 3 Update profile 3 2 Batch F calculation (as base for 7 days stats) Process flow Business rules Update history& Sample DB TR logic 4 Profile Table History data 1 2 3 Get profile Feature calculation Update profile Data flow SQL Select SQL Update Mainframe 32
Real Time Scoring Performance Comparison Remote scoring vs UDF in Database scoring Transactions per second 4500 4000 3500 3000 2500 2000 1500 1000 500 0 DB2 on z/os z196 LPAR with 2 CPs SPSS Linux on z z196 LPAR with 2 IFL 414 20.2 1755 11.6 Remote DB, Remote score Remote DB, indb score Local access, Remote score Trx/sec RT 578 6.9 4050 1 Local access, indb score Measurements optimized for max throughput on fully utilized system. Response times include full transaction with multiple DB accesses 25 20 15 10 5 0 Response Time - Milliseconds 33
Pure SQL Vs Scoring Adapter (UDFs) for Model Scoring Pure SQL Difficult to support some model scoring algorithms Requires a SQL mapping to be constructed for each model type Resulting SQL will run on many database systems No database extensions required Performance/reliability harder to predict Harder to generate SQL to score ensemble models Scoring Adapter (UDFs) Easily supports a large class of scoring algorithms Reuses existing scoring component to score each model type Needs to be adapted for each database system requiring support Pure SQL Requires database extensions to be installed Performance/reliability Di easier to predict Easier to score ensemble models 34
Question 35
Controlling outcomes with predictive analytics Demographic data Transaction data Analyses Segments Profiles Scoring models Anomaly detection... Reports, KPIs, KPPs External data Domain Expertise Define List Assign weight (points) to each indicator... Scoring Define Thresholds Determine the level of Risk Capture Predict Act 37