ANALYTICS MODERNIZATION TRENDS, APPROACHES, AND USE CASES
STUNNING FACT Making the Modern World: Materials and Dematerialization - Vaclav Smil
Trends in Platforms Hadoop Microsoft PDW COST PER TERABYTE Oracle Greenplum Teradata Vertica $- $20 $40 $60 $80 $100 Thousands $20 $18 $16 $14 $12 $10 $8 $6 $4 $2 $- COST PER GIGABYTE Today 2009 COST OF STORAGE, MEMORY, COMPUTING In 2000 a GB of Disk $17 today < $0.07 In 2000 a GB of Ram $1800 today < $1 In 2009 a TB of RDBMS was $70K today < $ 20K
Shift in Mindset Scarcity Abundance Technology constrained Process-centric Focus on cost control Everything is forbidden unless it is permitted Focus on value Discovery-centric Technology empowered Everything is permitted unless it is forbidden
ADVANCED ANALYTICS TEXT ANALYTICS Finding treasures in unstructured data like social media or survey tools that could uncover insights about consumer sentiment FORECASTING Leveraging historical data to drive better insight into decision-making for the future INFORMATION MANAGEMENT OPTIMIZATION Analyze massive amounts of data in order to accurately identify areas likely to produce the most profitable results DATA MINING Mine transaction databases for data of spending patterns that indicate a stolen card.. STATISTICS Copyright 2011, SAS Institute Inc. All rights reserved. 6
CURRENT TRENDS IN ANALYTICS Complex Business Problems Are Driving Analytics Innovation Speed Will Be Of Essence Leverage Analytics To Unlock The Information Contained In Unstructured Data Operationalizing Analytics
VALUE OF BIG DATA Value of Data Unknown How should we adjust? What s the cause? Questions We Know Questions We Don t Know Sales are Down! Why? Value of Data Known
VALUE OF BIG DATA Value of Data Unknown How should we adjust? What s the cause? Questions We Know Questions We Don t Know + EDW C op yr i g h t 2 0 1 3, S A S I n s t i t u t e I n c. A l l r i g h t s r es er v e d. Why? Value of Data Known
VALUE OF BIG DATA Value of Data Unknown How should we adjust? What s the cause? Questions We Know Questions We Don t Know + EDW C op yr i g h t 2 0 1 3, S A S I n s t i t u t e I n c. A l l r i g h t s r es er v e d. + Value of Data Known EDW
VALUE OF BIG DATA Value of Data Unknown How should we adjust? + Questions We Know Questions We Don t Know + EDW C op yr i g h t 2 0 1 3, S A S I n s t i t u t e I n c. A l l r i g h t s r es er v e d. + Value of Data Known EDW
VALUE OF BIG DATA Value of Data Unknown + + Questions We Know Questions We Don t Know + EDW C op yr i g h t 2 0 1 3, S A S I n s t i t u t e I n c. A l l r i g h t s r es er v e d. + Value of Data Known EDW
VALUE OF BIG DATA Value of Data Unknown + + Data Scientist Statistician Questions We Know Questions We Don t Know + + Manager EDW C op yr i g h t 2 0 1 3, S A S I n s t i t u t e I n c. A l l r i g h t s r es er v e d. Business Analyst Value of Data Known EDW
VALUE OF BIG DATA + + Statistician Data Scientist + + Manager EDW C op yr i g h t 2 0 1 3, S A S I n s t i t u t e I n c. A l l r i g h t s r es er v e d. Business Analyst EDW
SAS ON HADOOP
SAS BIG DATA STRATEGY SAS AREAS
17 Japan
SAS WITHIN THE HADOOP ECOSYSTEM User Interface SAS Enterprise Guide SAS Data Integration SAS Enterprise Miner SAS Visual Analytics SAS In-Memory Statistics for Haodop SAS User Metadata SAS Metadata Next-Gen SAS User Data Access Base SAS & SAS/ACCESS to Hadoop SAS Access to Impala In-Memory Data Access Data Processing Pig Hive Map Reduce SAS Embedded Process Accelerators Impala SAS LASR Analytic Server SAS High- Performance Analytic Procedures MPI Based File System HDFS
SAS ACCESS TO HADOOP SAS SERVER Hive QL HADOOP
SAS/ACCESS TO CLOUDERA IMPALA General-purpose SQL query engine: should work both for analytical and transactional workloads will support queries that take from milliseconds to hours low latency response, 10-100x faster than Hive Runs directly within Hadoop: deploy on existing Hadoop clusters reads widely used Hadoop file formats talks to widely used Hadoop storage managers runs on same nodes that run Hadoop processes High performance: C++ instead of Java runtime code generation completely new execution engine that does not build on MapReduce
SAS / EMBEDDED PROCESS SAS SERVER SAS Data Step & DS2 HADOOP SAS/Scoring Accelerator for Hadoop SAS/Code Accelerator for Hadoop (2014 Q3) SAS/Data Quality Accelerator for Hadoop (2014 Q3) SAS/Data Director* (Name TBD 2014 Q3) proc ds2 ; /* thread ~ eqiv to a mapper */ thread map_program; method run(); set dbmslib.intab; /* program statements */ end; endthread; run; /* program wrapper */ data hdf.data_reduced; dcl thread map_program map_pgm; method run(); set from map_pgm threads=n; /* reduce steps */ end; enddata; run; quit; 21
SAS DATA LOADER What directive do you want to perform? Show: All Directives Saved Directives Open a previously created directive to run, view, or edit. Schedule a Directive to Run Schedule a directive to run at specified dates and times Chain Directives Together Run a number of directives in a specific order. Copy Data for Visualization Copy data from Hadoop and load it into LASR for visualization. Existing data in the target table will be replaced. Copy Data to Hadoop Copy data from a source and load it into Hadoop. Existing data in the target file will be replaced. Join Tables in Hadoop Create a table in Hadoop from multiple tables. Pivot a Table in Hadoop Transpose the columns of a table in Hadoop. Transform Data in Hadoop Transform the data in an Hadoop data file. Verify Mailing Address Check the validity of the mailing address data in a table. Profile Data Create a report profiling the data in a table. Generate Business Rules Analyze data in a table and generate business rules. Send Data for Remediation Select data to send to the remediation queue for further action.
SAS / HIGH PERFORMANCE ANALYTICS SAS SERVER SAS HPA Procedures HADOOP SAS High-Performance Statistics SAS High-Performance Data Mining SAS High-Performance Text Mining SAS High-Performance Econometrics SAS High-Performance Forecasting SAS High-Performance Optimization
SAS / HIGH PERFORMANCE ANALYTICS Prepare Explore / Transform Model HPDS2 HPDMDB HPSAMPLE HPSUMMARY HPCORR HPREDUCE HPIMPUTE HPBIN HPLOGISTIC HPREG HPNEURAL HPNLIN HPCOUNTREG HPMIXED HPSEVERITY HPFOREST HPSVM HPDECIDE HPQLIM HPLSO HPSPLIT HPTMINE HPTMSCORE Copyright Copyright 2013, SAS 2014, Institute SAS Inc. Institute All rights Inc. All reserved. rights reserved.
IN-MEMORY (LASR BASED) SOLUTIONS ON HADOOP Added Slide SAS ANALYTIC HADOOP ENVIRONMENT WEB CLIENTS APPLICATIONS Data Director* SAS LASR ANALYTIC SERVER SAS IN-MEMORY HADOOP ERP SCM CRM Visual Analytics SAS IN-MEMORY Images Visual Statistics SAS IN-MEMORY Audio and Video In-Memory Statistics SAS IN-MEMORY Machine Logs Visual Scenario Designer SAS IN-MEMORY 25 Text f Web and Social
SAS VISUAL ANALYTICS Interactive exploration, dashboards and reporting Auto-charting automatically picks the best graph Forecasting, scenario analysis, Decision Trees and other analytic visualizations Text analysis and content categorization Feature-rich mobile apps for ipad and Android 26 Copyright Copyright 2013, SAS 2014, Institute SAS Inc. Institute All rights Inc. All reserved. rights reserved.
27 Japan
SAS VISUAL STATISTICS Interactive, visual application for statistical modeling and classification Multiple methods: logistic, Regression, GLM, Trees, Forest, Clustering and more Model comparison and assessment Group BY Processing Copyright Copyright 2013, SAS 2014, Institute SAS Inc. Institute All rights Inc. All reserved. rights reserved.
Japan
SUMMARY SAS ON HADOOP OPTIONS SAS Access for Hadoop SAS Accelerators (Scoring, Code, Data Quality) High Performance Analytics Visual Analytics Visual Statistics In Memory Statistics for Hadoop (coding for Data Scientist) Copyright Copyright 2013, SAS 2014, Institute SAS Inc. Institute All rights Inc. All reserved. rights reserved.
USE CASES
A LARGE CANADIAN BANK ~20 million customers ~50 countries ~85000 employees Customer Pain: Building good models takes too long!
ITERATIVE APPROACH TUNING MULTIPLE CUSTOMER STATE DATA MARTS Multiple Iterations Segmented Models Opportunity Acquisition Baseline models
MODELING RESULTS Projected profit increase to Client (Cumulative) $6 million POC Objective was to build 3 models SAS Modelers built 10 Models POC validated and monetized the business impact of High Performance Data Mining and SAS Data Management Better results Increased productivity Customer Proceeds with HPA and Data Management 34
LARGE TELECOMMUNICATIONS COMPANY (AP REGION) Wireless Group 70+ mil subscribers, 50+ mil are active 98% are pre-paid, the rest are post-paid With at least 20 SMS/Day/Active Subscriber, more than 1B SMS are processed daily Wireline and Broadband ~2million subscribers for residential/individual lines ~200,000 for commercial business Customer Pain Point Volume of data and complexity of requirement has outgrown the legacy infrastructure Processing time limits creativity and fine grained campaigns 35
BAU TO SAS HPA PLANNING HIGH PERFORMANCE ANALYTICS Data Sources Grid Computing In-Memory Analytics SAS 9.3 SAS 9.4 SAS 9.4 SAS DI-EG SAS CM SAS HPAS/EM SAS VA EXADATA Other sources High Performance Switch Legacy sources SAS Datasets RDBMS SAS Datasets SAN STORAGE 36
32 X SERVERS Configuration Workflow Step CPU Runtime Ratio Client, 24 cores Explore (100K) 00:01:07:17 4.2 Partition 00:07:54:04 19.5 Impute 00:01:19:84 7.7 Transform 00:09:45:01 13.2 Logistic Regression (Step) 04:09:21:61 131.5 HPA Appliance, 32 x 24 = 768 cores Total 04:29:27:67 106.1 Explore 00:00:15:81 Partition 00:00:21:52 Impute 00:00:21:47 Transform 00:00:44:28 Logistic Regression 00:01:37:99 Total 00:02:21:07 Acceleration by factor 106! 37
32 X SERVERS Configuration Workflow Step CPU Runtime Ratio Client, 24 cores HPA Appliance, 32 x 24 = 768 cores Explore 00:01:07:17 4.2 Partition 01:01:09:31 170.5 Impute 00:02:45:81 7.7 Transform 01:26:06:22 116.7 Neural Net 18:21:28:54 478.9 Total 20:52:37:05 313 Explore 00:00:15:81 Partition 00:00:21:52 Impute 00:00:21:47 Transform 00:00:44:28 Neural Net 00:02:17:40 Total 00:04:00:48 Acceleration by factor 322! 38
THANK YOU sas.com