Delivering Value with Big Data Copyright 2014 World Wide Technology, Inc. All rights reserved. 0
WWT Big Data Leadership Team James Bigger Principal Consultant Brian Vaughan Principal Consultant Chris Ward Principal Consultant Jason Lu Chief Scientist Matt DuBell Principal Systems Engineer 20 years of management consulting and entrepreneurial experience. Expertise in financial services, insurance and telecom. Prior consulting experience with Opera Solutions and A. T. Kearney. Ph.D. in Physics from Oxford University. 15 years in management consulting, analytics and software experience. Expertise in healthcare and insurance. Prior experience with Opera Solutions, Mitchell Madison Group and Broadlane. Ph.D. in Physics from Stanford University. 20 years in management consulting and executive leadership. Expertise in retail, marketing, hospitality & financial services. Prior consulting experience with Opera Solutions and The Boston Consulting Group. BA from Princeton University, MBA from the University of Virginia 18 years of analytics and software development experience. Expertise in financial services, healthcare, insurance, retail and marketing science. Prior analytics development experience at Opera Solutions, FICO and J.D. Power and Associates. Ph.D. in Physics from Stanford University.. Over 20 years of experience in a range of IT and security disciplines. Responsible for deploying large, secure, Hadoop-based platforms for the U. S. Government. 10 year of international experience implementing networking and virtual data center environments Undergraduate degree from AIU. Prem Jain Principal Architect Mike McGlynn VP Emerging Technologies Yoni Malchi Engagement Manager Chris Infanti Engagement Manager Jamie Milne Engagement Manager Over 20 years of experience in enterprise datacenter, building innovative solutions in Big Data, storage, HPC, virtualization, data migration and enterprise applications. Formerly lead architect for NetApp's Big Data solutions, and led the development of the FlexPod select solutions. B.S. in Electrical Engineering. 25 years of government service at the National Security Agency. At the NSA Mike led the design and development of nextgeneration cyber systems; real-time systems, situational awareness tools, and command and control capabilities. M.S. in Computer Science from Johns Hopkins. B.S. in Mathematics Over 7 Years of experience in management and analytics consulting. Led engagements in telecom at Opera Solutions. Previous experience performing predictive analytics for NASA and USAF at The Aerospace Corporation. Ph.D. in Mechanical Engineering from Pennsylvania State University. Over 8 years of experience in analytics consulting and delivery management. Ran engagements in wealth management, corporate security, marketing, education and transportation at Opera Solutions and IBM Global Business Services. BS in Mathematics from Georgetown University. Over 7 Years of management consulting and entrepreneurial experience. Expertize in financial services, travel, and retail sectors across US and Europe. Led Big Data strategy and analytical engagements at Opera Solutions. MSci in Astrophysics from the University of Cambridge. 1
Big Data Capabilities Big Data projects operate at the intersection of business, science, and technology $$$ BUSINESS Highlights areas of high opportunity Drives focus on value creation f x = a 0 + n=1 a n cos nπx L + b n sin nπx L DATA SCIENCE Solves business problems Proves solutions based on empirical evidence TECHNOLOGY Captures and stores data on business Facilitates the operation of data science 2
Job Flow OOZIE The Big Data Software Stack The big data ecosystem includes open source and proprietary distributions that span the stack from ingest through analytics USER/MACHINE WORKFLOW DECIDE ANALYZE ORGANIZE ACQUIRE DATA ANALYTICS ACCESS/ QUERIES ANALYTICS DATABASE TRANSFORM MANAGEMENT FILE SYSTEM/ DATABASE INGEST MICROSTRATEGY LAYER PROPERTIES OPTIONS EXAMPLES OF PRODUCTS INTEGRATED OFFERINGS BUSINESS OBJECTS Real Time & Batch Optimized for high vol reads Flexible, Compressed, Fast Read Fast, Scalable OLAP Natural Language Custom Analytics Custom API s SQL Columnar In Memory Parallel RDBMS Provisioning Maintenance HDFS Parallel, NoSQL Distributed - Document - Key-Value - Wide Column Interfaces Flexible interfaces: to Batch accept data Streaming R PYTHON ZOOKEEPER HADOOP CASSANDRA HBASE MONGODB TERADATA NETEZZA GREENPLUM VERTICA CLOUDERA HORTONWORKS MAPR PIVOTALHD EMC/PIVOTAL HD / GREENPLUM HP/VERTICA/CLOUDERA ORACLE BIG DATA EXADATA/EXALYTICS IBM INFOSPHERE BIGINSIGHTS SAP HANA TERRACOTTA BIGMEMORY Enterprise Structured Enterprise Unstructured 3 rd Party Web/ Unstructured ODS Data Warehouse MapReduce Call Center Server Logs SQL PIG HIVE HADOOP SQOOP FLUME Financial Demographic SAS SPSS SPLUNK TALEND COGNOS ORACLE OBIEE PLUS OPEN SOURCE COMMERCIAL OPEN SOURCE SOLUTIONS 3
Dual Approach to Delivering Big Data Solutions WWT offers customers both strategic and tactical approaches to derive value from the application of Big Data analytics and technology BUSINESS IMPACT Extract value from data to drive multiple Use Cases TECHNOLOGY OPTIMIZATION Accomplish data tasks, faster, cheaper, better Consulting services Big Data Strategy Big Data POCs Big Data Sustainment Offerings Data Warehouse Optimization SAP HANA Implementation 4
Defining The Opportunity Is The Starting Point The power of Big Data lies in bringing together data in a timely fashion from sources within and external to the enterprise - structured and unstructured - to create a complete view of critical issues, therefore enabling advanced analytics to unlock key insights that drive significant value Outcome Clearly defined use cases with the potential to deliver significant value by distilling vast data into new, previously unknowable intelligence Analytics Advanced machine learning techniques to analyze data and mine for insights to drive critical decisions Data Structured or unstructured, internal or external, requiring new methods of storage/integration Technology Emerging/new technology stacks using scalable, distributed architectures 5
C a s e S t u d y C i t y o f D a l l a s Big Data Initiative - Overview O B J E C T I V E K E Y D E L I V E R A B L E S Formulate a Big Data strategy for the City of Dallas, assessing potential opportunities and creating an implementation roadmap to capture them S C O P E Dallas Police Department Court and Detention Services Dallas Water Utilities Code Compliance Office of Financial Services Human Resources Department of Public Works Sustainable Development & Construction Services Dallas Fire-Rescue Equipment & Building Services Streets Data Environment Assessment View of the current data environment at the source level, including volume of data Summary of current data challenges both organization-wide and by department where applicable High-level summary of external data sources Big Data Needs Document Definition of Big Data in the City of Dallas context Detailed documentation of 30+ use-cases, outlining required data sources, data sharing needs, and a complexity/value breakdown Key dependencies and considerations for each use case Data Management Strategy Documentation of key strategic short-term and long-term objectives in areas of infrastructure technology and data management High-level roadmap that addresses data quality, governance, operations, and security, taking key objectives into account Proposed Big Data roles and responsibilities Big Data Roadmap Use-cases organized in roadmap timeline with clearly outlined data and technology architecture dependencies, and documentation of criteria used to prioritize use-cases Data management strategy timeline that shows key milestones for both data management policy enactment and organizational changes Approach to deploying the capabilities needed to implement the roadmap 6
C a s e S t u d y C i t y o f D a l l a s Property 360 Description: Create a 360 degree view of a property using data from multiple departments, raising effectiveness and awareness of Code Compliance inspectors Integrate Data Sources Make Informat ion Available to Inspectors A p p r o a c h Join information on a property from multiple data sources across departments, including: SDC Posse for building permit and owner information DWU SAP Billing for current tenant information Code CRMS for Code inspection history DFR - Fire inspection and incident history DPD RMS for police incident history Third party information on area demographics Make data available to inspectors in the field in ways that will impact their operational effectiveness: Create a simple, mobile device-accessible visualization of data for a given address Basic information on building owner and occupancy history to decrease time spent on looking up tenant information Timeline of building inspection history to avoid repeat inspections and gain intelligence from other departments DPD and demographics data incorporated to increase safety E v a l u a t i o n Strategic Alignment Useful for many other departments, including DPD, DFR, DWU, SDC Considerations Field mobility will increase effectiveness of program Security is important, especially w/ DPD data Risks Ability to identify keys to join data across multiple data sets Dependencies Consolidation of data from multiple departments Visualization (preferably mobile) Impact Complexity Increase efficiency and safety of Code Compliance inspectors by making all property data available Data set created here will be useful to other departments, and will be a foundation for other use-cases 7
C a s e S t u d y C a s i n o Visibility Into A Customer s Journey Ability to combine and analyse multiple data sources very rapidly to understand the hidden drivers of individual customer behaviour enables the changing of long-term behaviour through personalized curricula and aspirational treatments Internal Data External Data Casino Hotel Marketing Customer Demographics Ratings by game type (tables, slots, poker) Hotel reservations and transactions Offers and mails Behavioral and profile indicators Demographic Appended Data and Customer Geo Coding Longitudinal 360⁰ customer view Customer Profile Customer Activity Feb. 2011 Apr. 2011 Jun. 2011 Customer : XXXXXXX Male, 52 Resides in Orlando, FL 2 trips in 2010 - Table Only, No slot play Zip Code Annual Household Income : $100,000 Feb.1 $75 Free Slot Play offer received by email. Mar.3 Apr.6 Checked-in at 2:15pm with his wife, ordered room service 6pm 2 free nights offer (Apr 6-8) for 2 No response received by email Played Tables and Poker 3 hrs 20 mins, (ADT $345) Apr.7 No Play Ate at Restaurant 12:30pm Apr.8 Played Tables - 2 hrs (ADT $180) Joint Account created Checked-out at 11:40am 8
C a s e S t u d y C o n s u m e r E l e c t r o n i c s Social Media Analytics Typically social media tools focus on monitoring past/present activity. Predictive analytics allows users to identify important threads and intervene early, shifting the focus to future activity Word cloud shows ongoing buzz and sentiment Tabular view shows emerging themes and sentiment, virality score and recommended timewindow for action Details on particular themes or attributes Forecasts trend and a mechanism to intervene in attribute that are going viral 9
C a s e S t u d y C r e d i t C a r d First-Party Fraud A Fortune 50 financial credit card issuer transformed its current approach to detecting Bust out fraud Bust-outs drove $350MM+ in losses annually Over 90% of accounts were identified too late in the process to stop fraud - it is an Analytic and Business necessity to score accounts in near-real time Current Bust-out Detection Timeliness Frequency distribution Bust-out before detection B A C K G R O U N D 91% Detection before Bust-out A P P R O A C H & R E S U L T S Customer activity patterns were monitored on a daily basis to identify patterns predictive of Bust-outs Multitude of new metrics (e.g. transaction activity, payment activity and other variables) were defined and used in the detection algorithm: A new, neural net based predictive model which significantly improved detection accuracy, 5 days earlier Benefits from predictive model 100% Model Lift Curve 1 Bustout Capture Rate Neural Network 90% 80% 70% 60% 50% Existing Score Logistic Model < -14-13 -12-11 -10-9 -8-7 -6-5 -4-3 -2-1 0 1 2 3 4 >5 days Reduce bust-out losses through: Predicting bust-out accounts earlier Prioritizing predicted cases to increase manual review hit rate and number of Bust-outs detected 40% 30% 20% Random 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Population Capture Impact Old New Lead Time (days) - 5 Action Rate (%) 7 25 10
Dual Approach to Delivering Big Data Solutions WWT offers customers both strategic and tactical approaches to derive value from the application of Big Data analytics and technology BUSINESS IMPACT Extract value from data to drive multiple Use Cases TECHNOLOGY OPTIMIZATION Accomplish data tasks, faster, cheaper, better Consulting services Big Data Strategy Big Data POCs Big Data Sustainment Offerings Data Warehouse Optimization SAP HANA Implementation 11
Data Preparation (ETL) Data Preparation (ELT) Current Data Warehouse Environment Source Systems Operational Systems Data Preparation (e.g. Informatica) Data Warehouse (e.g. Teradata, Netezza) $17K/TB Hot Data Access Layer Reporting & Analytics ERP, CRM Cold Data Third Party Data Unstructured Data 1 2 3 4 Large amounts of unstructured data do not make it into DW due to rigid schema Preparation of data for warehouse discards potentially valuable data Additional preparation runs on DW, increasing storage and decreasing performance Cold data dominates DW storage, rarely accessed by end users 12
Data Preparation Data Warehouse Optimization (DWO) Source Systems Operational Systems Hadoop ~$2K/TB Data Warehouse (e.g. Teradata, Netezza) ~$17K/TB Hot Data Access Layer Reporting & Analytics ERP, CRM All Data 3 Third Party Data Unstructured Data 1 2 4 Unstructured data can now be loaded into Hadoop in native format Low-cost Hadoop environment enables retention of all source data for analysis Data warehouse performance increases and storage cost decreases Users can access Hadoop directly for some analytics and reports, further decreasing DW storage and processing requirements 13
Advanced Technology Center Demonstrations Workshops Hands-on Labs Proofs of Concept Advisory Services Benchmarking NETWORK SECURITY COLLABORATION DATA CENTER BIG DATA Next Generation Networking Nexus (7K, 5K, 3K & 2K) Virtual Networking (Nexus 1000v) OTV, LISP, Fabric Path Layer 2 Extension DR/BC Networking Cybersecurity Solutions BYOD (Bring Your Own Device) Secure Mobility Jukebox ISE & RSA ASA 1000v VSG (Virtual Security Gateway) Unified Communications (also on UCS) Tandberg Video VXI (View and XenDesktop) WebEx, Call Center and Collaboration Solutions Phones, Backpacks and Soft Phone Clients TelePresence and Business Video Solutions Vblock, FlexPod and CloudSystem Matrix EMC and NetApp Storage vsphere / XenServer vcloud Director VDI (View / XenDesktop) Cisco CIAC and BMC CLM EMC s UIM and Cloupia FAST MDC (Mobile Data Center) Solutions Cisco UCS C220, C240 HP DL380, Nexus 2200, UCS 6296 FlexPod Select, Isilon storage Cloudera, MapR, PivotalHD Cloud Foundry Velocidata Appliance Next Generation provisioning tools EXPLORE EVALUATE ARCHITECT IMPLEMENT 14
Local, DAS, and NAS Infrastructures in the ATC REFERENCE ARCHITECTURE 1 REFERENCE ARCHITECTURE 2 REFERENCE ARCHITECTURE 3 REFERENCE ARCHITECTURE 4 HP Internal Local Storage UCS NetApp Direct Attached Storage UCS Isilon Network Storage SAP HANA VISUALIZATION TABLEAU TABLEAU TABLEAU ANALYTICS TOOLS STREAMING TOOLS SPARK R KAFKA SPARK MADLIB PYTHON STORM TRIDENT SPARK R KAFKA SPARK MADLIB PYTHON STORM TRIDENT PYTHON KAFKA STORM SAP HANA ANALYTICS DATABASES IMPALA HIVE HBASE HAWQ IMPALA HIVE HBASE HAWQ HIVE HBASE FILE SYSTEM/ DATABASES CLOUDERA HORTON PIVOTALHD MAPR CLOUDERA HORTON PIVOTALHD MAPR HORTON CLOUDERA HORTON MAPR NETWORK NEXUS 2200 UCS 6296UP NEXUS 2232PP UCS 6296 NEXUS 2200 UCS B BLADES COMPUTE HP DL 380 UCS-C220M3 UCS-C240 UCS-B440M2 STORAGE JBOD SATA NETAPP E5460 ISILON HITACHI DATA Enterprise Structured Enterprise Unstructured 3 rd Party Web/ Unstructured ODS Data Warehouse Call Center Server Logs Financial Demographic 15
First Step: Big Data Workshop 16