Big Data, Start Small! Dr. Frank Säuberlich, Director Advanced Analytics (Teradata International) 26 th May 2015
Agenda Introduction Big Data And The Emergence Of The Logical Data Warehouse Architecture Starting Small Example Deployments Lessons Learned Summary & Conclusions 2 Image source: Steven Dyer on www.photoshopcreative.co.uk
Human Generated Business Generated It s a Data Revolution Machine Generated Interaction Generated 3
The data-driven economy is emerging 1990s NOW Internet Economy Data Economy 4
The data-driven business puts data at the center 5
Data rich, insight poor 6
7 Required Capabilities The Logical Data Warehouse, a.k.a.: UDA
Analysts agree the Logical Data Warehouse is the future of Enterprise Analytical Architecture Gartner Logical Data Warehouse even if they can t agree what to call it Forrester Enterprise Data Hub We will abandon the old models based on the desire to implement for high-value analytic applications. Raw data in an affordable distributed data hub Firms that get this concept realise all data does not need first-class seating. 8
Big Data Starting Small Example Deployments Advanced Telco Churn Analysis Portfolio Optimization Predictive Maintenance A common misconception is that you can t start a Big Data project unless / until you have invested tens of millions of dollars in a fully-integrated Logical Data Warehouse, including a petabyte-scale Hadoop cluster, etc., etc., etc.; In the remainder of this presentation we will demonstrate that this is not the case and identify key learnings from other customers who have started small with Big Data. 9
Big Data: small Telecommunication Adcanved Churn Analysis Asian mobile telecommunications operator with > 6M subscribers Network and customer service problems had tarnished the company s reputation Urgent requirement to supplement the Analytics provided from the existing CDR Data Warehouse with Analytics that directly measured network performance and its impact on customer churn. 10
Big Data: small call drop outs data drop outs in web (PDP) sessions Path to churn Churn level of call quality (voice and data speed) * Sessionization * npath analysis Customer experience score 3G to 2G drop down and length of time on 2G. sentiment analysis from call center records (Didata) Propensity-to-churn model Propensity to churn Network data schema is evolving rapidly, so flexible information model is critical; Sessionization pre-packaged SQL-MapReduce function identifies sessions from time series data in a single pass over the data; npath - pre-packaged SQL-MapReduce function for finding sequences of events; Network data is first correlated in Aster and then stored in Hadoop, to optimise retention costs; Integration to EDW for Customer Reference / Profitability Data. 11
Big Data: small Project Delivery 3 phases, each of circa 3 months. Project Effort/Investment H/W & S/W 8 PS resources Business Value Operator is able to understand network performance from a customer perspective for the first time: improved customer service, reduced churn and revenue leakage from false complaints. 12
Big Data: smaller Banking portfolio optimization Large Retail Bank that had been forced to foreclose on very many residential mortgages - and that had a very large property portfolio to dispose of as a result Needed to understand market conditions and competitor pricing much better in order to ensure a rapid but orderly and efficient - disposal of these assets 13
Big Data: smaller Project Delivery Initial discovery PoC to demonstrate key concepts was delivered in 3 weeks by a small team of key Teradata, Bank staff. Project Effort/Investment H/W & S/W 3 PS resources Business Value First use-case alone has an estimated $2M impact to bottom-line profitability. 14
Big Data Smallest Manufacturing Industry Predictive Maintenance for Trains Large European train operator wanted to leverage engine sensor data to predict train failure Started with a small training set consisting of roughly one million sensor log observations and several thousand Engineer reports describing failure / fix Relevant data - several million train sensor observations and several thousand engineer s reports and their preparation 15
exploring the data using path and graph Analytics Affinity graph which components fail in combination (within the same train) identify candidates for failure prediction Sankey diagram exploring the path to failure (testing different categorizations of sensor readings as events) 16
actual actual using our understanding to build a predictive model...having profiled the predictive variables in this way, we have built a decision tree algorithm to predict engine failures Node 2 Failure Pct 3.20% Node 1 Failure Pct 3.41% Gear Power output low daily percentage <=0.44 Coolant temperature high daily pct <=0.204 Node 269 Failure Pct 15.98% Node 0 Failure Pct 3.55% Gear Oil Temp high daily percentage <= 0.256 Engine temperature high daily pct <=0.222 Node 287 Failure Pct 0.00% Node 286 Failure Pct 46.32% Gear Power output low daily percentage <=0.628 Node 288 Failure Pct 100.00% Confusion Matrix on Training and Test Data Sets Training Data Set prediction no failure 99% 1% no failure failure Test (holdout) Data Set prediction no failure failure no failure 99% 1% failure 13% 87% failure 16% 84% with high quality High degree of accuracy of the predictive model Very similar results on training and test (holdout) data sets (no overfitting) 17
Big Data: smallest Project Delivery First-cut model delivered on a PoC basis in only 2 weeks. Project Effort/Investment No up-front investment in H/W & S/W (PoC) 2PS resources Business Value Improved availability through significant reduction of unplanned downtime; reduced labour costs (quicker root cause analysis, improved first time fix rate, etc.); improved utilisation (more mileage, same trains). 18
Lessons learned from early deployments #1 By themselves, a big bucket of data and some fancy Analytic technology add no value; start with a business problem, not with a technology (ours or anybody else s). #2 New Big Analytics is often additive In many cases, Big Analytics is extending and enhancing existing analyses and business processes, not replacing them #3 Old business process + Expensive new technology = Expensive old business process The objective is not merely to gain insight the objective is operationalise that insight so that we change the way we do business #4 The time-consuming and expensive part of a traditional Business Intelligence & Analytics project is data integration; maybe we just shouldn t bother? #5 The failure rate for Analytic Exploration & Discovery is high, so cycle times are critical. 19
20 Summary & conclusions
The Logical Data Warehouse is the industry s adaptation to Big Data How will you deploy? How many / which platforms will you need? How will you integrate them? And which data need to be centralised and integrated? The Enterprise Data Warehouse Era The Logical Data Warehouse (a.k.a.: Unified Data Architecture) Era 1 Multi-structured data 2 Interaction / observation Analytics 5 3 4 Flat / falling IT budgets, exploding data volumes Agile Exploration & Discovery 1 3 2 4 Give me integrated, high quality data. 5 Operationalisation Centralise and integrate the data that are widely reused and shared, but integrate all of the analytics. 21
But equally, don t wait until you have deployed a full Logical Data Warehouse to start your Big Data journey Exploration & Discovery technology and processes can deliver value for you now and inform how you build-out your Logical Data Warehouse. 22
Thank you very much! Frank Säuberlich frank.saeuberlich@teradata.com 23