Big Data 101 Webinar A Functional Introduction Today s Presenters: Paul S. Barth, PhD, Managing Partner Prithwi Thakuria, Big Data Practice Lead NewVantage Partners An Introduction Structured Semi Structured Un-Structured What is Big Data? Batch Near Real Time Streaming What Problems will it solve? Store and Process large amounts of data for enterprise batch and near real time needs in a timely and acceptable manner. Insights Predictive Machine Learning Variety Value Velocity Volume Terabytes Petabytes Exabytes Ability to ingest and combine variety of data coming from different sources at different velocity and variety in a rapid manner to create value like enhanced analytics, deep insights and machine learning and prediction. Reduce significantly the time, money and resources towards enterprise infrastructure and business value generation. It is the next big frontier for innovation, productivity and competition. March 6, 2013 1
What led us here? EXABYTES TERABYTES GIGABYTES MEGABYTES Mobile Web Social Interactions Social Media A/B Testing Blogs, Wikis The Web Era Behavioral Targeting The CRM Era Data Warehouses Big Data Era User Content Speech to Text Sensors & RFID Spatial & GPS Sentiment SMS/MMS Analysis Partner Feeds The ERP Era Financials Inventory Trading Web Logs Segmentation Dynamic Pricing Offer Details Click Stream Analysis Search Marketing Demographics UN - STRUCTURED SEMI - STRUCTURED STRUCTURED Why is it possible now? Big Data leverages the cost/performance of large server grids and open source software. Cost per TB $37,00 0 $5,000 $2,000 Database Appliance Hadoop March 6, 2013 2
Problems to Opportunities The perspective has changed from being a problem about storage, processing, retrieving and analyzing to business opportunities. Firms are embracing new ideas and technologies to co-exist with existing investments. STORE SAN HDFS Hbase, Cassandra Hadoop DB Store Cloud Jive Chorus Enterprise Portals COLLABORATE OPPORTUNITY PROCESS SQL MapReduce Pig Hive, CloudBase OLAP Text/Data Mining Social/Semantic Analysis Visualization Reporting ANALYZE RETRIEVE SQL MapReduce Key-Value RESTFul Creating Value in Enterprise Accessibility to Data Enhanced visibility of relevant information and better transparency to massive amounts of data. Improved reporting to stakeholders. Decision Making Next generation analytics can enable automated decision making (inventory management, financial risk assessment, sensor data management, machinery tuning). Marketing Trends Segmentation of population to customize offerings and marketing campaigns (consumer goods, retail, social, clinical data, etc). Performance Improvement Exploration for, and discovery of, new needs, can drive organizations to fine tune for optimal performance and efficiency (employee data). Innovation Discovery of trends will lead organizations to form new business models to adapt by creating new service offerings for their customers. Intermediary companies with big data expertise will provide analytics to 3rd parties. 6 March 6, 2013 3
What are Business Doing with Big Data? NVP Big Data Survey Over 50 Fortune 500 executives and leaders 70% are large financial services companies 65 in-depth questions benchmarking investment levels, applications, organizational structure, and skills Key Findings Big Data Investments 85% have big data programs planned or underway 25% are spending over $10MM annually on big data 36% expect to spend over $10MM in 3 years The primary drivers are better analytics about Customers and Risk Most companies are using big data to integrate existing corporate data from diverse sources Not external data or advanced analytics 85% of the initiatives are sponsored jointly by the business and IT All companies are struggling to attract, grow, and retain data scientists Banking Bank of America CitiGroup JP Morgan RBS Citizens Financial US Bank Wells Fargo Bank Insurance Health Care Aetna Broad Institute Cigna CVS/Caremark The Hartford Harvard Pilgrim Health Care SunLife Financial Travelers United Healthcare Government Department of Defense General Services Administration Department of Health and Human Services Social Security Administration Investments Bank of New York Mellon Charles Schwab Conning Asset Mgmt Fidelity Investments ING Putnam Investments State Street Bank TD Ameritrade TIAA-CREF Wellington Financial Financial Services / Other American Express Freddie Mac General Electric (GE) MasterCard Pitney Bowes Thomson Reuters VISA Wright Express Media and Technology Avid Technology Time Warner Cable Survey results available at www.newvantage.com Proprietary Information Known Industry Adopters (Hadoop World 09) Organization Use Case Visa JP Morgan Chase China Mobile Rackspace eharmony General Sentiment Yahoo! Visible Technologies Facebook Sears Crédit Mutuel Arkéa Large scale transaction analysis Data processing for financial services Data mining platform for telecom industry Cross data center log processing Matchmaking in the Hadoop cloud Understanding natural language Social graph analysis Real-time business intelligence Data warehouse with Hadoop and Hive Mainframe Migration Mainframe Migration 8 March 6, 2013 4
Case I: Customer Cross-channel Path Analysis Load all customer activities on big data platform Web, call, branch, marketing, service, transactions Develop an event series for each customer over 6 months Identify most common paths to sale, attrition, and outliers Platform Cost Data Loading Analytics Big Data < $1MM 2 days 25 Lines of Code Relational > $5MM 1 month 25,000 Lines of SQL Processing Time 40 hours 2 months Big data benefits Organize the data as needed, after it is loaded Event series data is non-relational but simple to program Query and analysis are run in a single pass of the data Case II: Operational Data for Risk Analytics Load operational mainframe data files on big data platform Nightly snapshots from 20+ systems all fields, all records Use standard ETL tools to select data for ad-hoc requests Organize data into relational format for future reuse Platform Cost Data Loading Data Delivery Big Data < $1MM 2 days 1 week Data Warehouse > $10MM 12 months 1 week Processing Time 2 hours 16 hours Big data benefits All data is available for ad-hoc requests Data is delivered to the business while the relational database is being built Integration with ETL and relational tools leverages existing skills March 6, 2013 5
So What is Hadoop? Hadoop is a free, Java-based data management framework from Apache that supports the processing and computation of large data sets in a distributed computing environment. It allows the capture, process and share data in any format and scale. Manage, operate, gain insights and create analytics for innovation, productivity and competition. Operate Open and promote the exchange and integration of data with new and existing enterprise applications. Process and manage data of any size with tools to sort, filter, summarize and apply basic functions to the data. Ingest and store variety of data in real time and batch. Natively redundant and distributed programming model for large data sets Distributed, scalable, self-healing, high bandwidth and portable file system that splits tasks across processors near the data Capture Process Integrate Programming Model Storage 11 Hadoop Capabilities Capture Integrate Process Core Operate Templeton WebHDFS Sqoop Flume HCatalog Pig Hive HBase MapReduce HDFS Ambari Oozie Mahout Zookeeper Hadoopis a collection of many open source and commercial packages that create a data and analytics ecosystem for applications. Thousands of programmers continue to develop new capabilities. 12 March 6, 2013 6
Understanding MapReduce MapReduce is Google s programming paradigm, or framework, which represents an approach to handle dataintensive problems in a distributed manner. Basic notion: A computation is applied against a large number of records or partitions and intermediate results are generated (map function). Next the intermediate results are aggregated in some fashion to produce the final outcome (reduce function). Partition 1 Partition 2 Map Task 1 Intermediate 1 Reduce Task 1 Results 1 Partition 3 Partition 4 Partition 5 Map Task 2 Map Task 3 Intermediate 2 Intermediate 3 Reduce Task 2 Results 2 MapReduce can be successfully applied to the problem of scaling a software application through multicore processors and multiple machine cluster infrastructure (Cloud). 13 MapReduce Example: Face Recognition Spread 1 Million image records across 100 servers Map the matching program on 100 servers, each returning the top 10 matches Reduce the 1,000 results to return the top 10 best matches Step (secs) Server Load Image 1 Big Data Grid: 100 Servers 1 Scan Images Match Images Sort Top 10 1,000 1,000 1 10 10 1 Total 2,002 22 March 6, 2013 7
Enterprises are Integrating Big Data with Existing Systems Big Data Sources ERM/CRM ERP TRANSACTIONS Business Infrastructure DevTools Apps / Spreadsheets BI / Visualization CEP BPM Customer Facing Web Apps Mobile Apps Deep Analytics Next Generation Collaboration Self-Service Un-constrained Core Applications Financials OBSERVATIONS Discovery Tools ODS EDW Marts Low Latency / NoSQL Social Media Exhaust Data Web Logs INTERACTIONS Templeton WebHDFS Sqoop Flume HCatalog Pig Hive HBase MapReduce HDFS Ambari Oozie Mahout Zookeeper Public Domain Paid Demographics INFORMATION New /Custom Hadoop Operations Existing Confidential & Proprietary - NewVanatge Partners 15 Challenges To Capture the Full Potential Of Big Data Several Key Challenges Have To Be Overcome Governance Regulationand Security Organizational Change and Talent ITDelivery and Industry Structure Supporting Technology Ownershipof, and access to data. Traditional compliance and securitytools might not fit. Integrating enterprise and 3 rd party data has legal restrictions Big Data still in early stages There might be organizational changes required. Shortageof specialized analytical skills. New business model to take advantageof accelerated analytics Traditional SDLC models will limit business agility Legacy infrastructure in some industry sectors limits integration Politicalresistance and leadership buyin. Technologystill evolving. Analyticaltheory to support big data not mature. 16 March 6, 2013 8
Big Data Usage Patterns Pattern Description Example Exploration& machine learning Operational prediction Acceleratingaccess to operational data Bulk data operations & extreme ETL Stream & event analytics Iterating on large data sets, looking for patterns and new ways to predict future trends Big data feeds operational predictive models with new data upon which to base predictions Store and distribute raw, semistructured operational data for expert analysis Batch operations on data at massive scale are conducted using parallel processing techniques Rapidly changing data are processed in parallel using complex events or more sophisticated stream filtering and mining algorithms AMLand Fraud patterns, counterparty risk analysis, e-mail and social media analytics Online fraud detection, market alerts, trade analytics Rapidresponse to management and regulatory questions. Making data warehouse operations faster and cheaper with massive scale bulk data movements Trade analytics, fraud detection, online customization, next-best product, business alerts 17 Big Data Vendors Hadoop distributors Hadoop integrators Proprietary solutions Strategy, Execution and Delivery Contribute to Hadoop OSIor extend Apache and distribute their own flavor. Offer consulting and training services. Create additional tools to make Hadoop enterprise class. Cloudera Hortonworks Provide Hadoop integrations to their existing tools to access and analyze big data. Provide tools that make developing big data solutions easier. Provide solution frameworks and packages that use Hadoop under the hood. IBM Big insights Karmasphere Informatica EMC/Greenplum Have created their own data storage and analytic platforms. Generally meet the characteristics of big data MPP on top of commodity hardware. Teradata/Asterdat a LexisNexis HPCC Microsoft Linq2HPC Experience with traditional solutions for BI, analytics, databases and data governance Consider big data as part of overall business strategy and technical architecture Established design patterns to solve use cases across industries Hands-on expertise and proven methodologies NewVantage Partners 18 March 6, 2013 9
THANK YOU www.newvantage.com March 6, 2013 10