A holistic approach to Big Data Raul F. Chong Senior Big Data and Cloud Program Manager Big Data University Community Leader rfchong@ca.ibm.com 2013 BigDataUniversity.com
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
Big Data Adoption Phases
What is your Big Data source? What type of data/records are you planning to analyze using big data technologies?
What is your Big Data source? What type of data/records are you planning to analyze using big data technologies? Multiple responses accepted
What do you want to do with the Big Data collected? What kind of analytics do you want to perform on this big data?
What do you want to do with the Big Data collected? What kind of analytics do you want to perform on this big data? Multiple responses accepted
Use of Big Data globally and in the financial sector Multiple responses accepted
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
KTH Swedish Royal Institute of Technology Reducing Traffic Congestion Deployed real-time Smarter Traffic system to predict and improve traffic flow. Analyzes streaming real-time data gathered from cameras at entry/exit to city, GPS data from taxis and trucks, and weather information. Predicts best time and method to travel such as when to leave to catch a flight at the airport Results Enables ability to analyze and predict traffic faster and more accurately than ever before Provides new insight into mechanisms that affect a complex traffic system Smarter, more efficient, and more environmentally friendly traffic 11
University of Southern California Innovation Lab Monitors Political Debates Benefits Real-time display of public sentiment as candidates respond to questions Debate winner prediction based on public opinion instead of solely political analysts
Big Data A holistic approach Big Data is Not Only Hadoop! Examples where Hadoop is not entirely applicable: Cyber security, Stock market, Traffic control, Sensor information, monitoring trends in Social Media What if your company has many silos of information, difficult to move to HDFS? What about governance? Can we trust the source of this data?
Big data holistic approach: A platform Solutions Analytics and Decision Management Big Data Platform Big Data Infrastructure
Big data holistic approach: A platform The IBM Big Data Platform Solutions Analytics and Decision Management Big Data Platform Data Warehouse Delivers deep insight with advanced indatabase analytics & operational analytics Big Data Infrastructure
Big data holistic approach: A platform Solutions Analytics and Decision Management Big Data Platform Stream Computing Data Warehouse Analyze streaming data and large data bursts for real-time insights Big Data Infrastructure
Big data holistic approach: A platform The IBM Big Data Platform Solutions Analytics and Decision Management Big Data Platform Cost-effectively analyze Petabytes of unstructured and structured data Hadoop System Stream Computing Data Warehouse Big Data Infrastructure
Big data holistic approach: A platform Solutions Analytics and Decision Management Big Data Platform Hadoop System Stream Computing Data Warehouse Govern data quality and manage the information lifecycle Information Integration & Governance Big Data Infrastructure 18
Big data holistic approach: A platform Solutions Analytics and Decision Management Big Data Platform Speed time to value with analytic and application accelerators Accelerators Hadoop System Stream Computing Data Warehouse Information Integration & Governance Big Data Infrastructure
Big data holistic approach: A platform The IBM Big Data Platform Solutions Analytics and Decision Management Discover, understand, search, and navigate federated sources of big data Visualization & Discovery Big Data Platform Application Development Systems Management Accelerators Hadoop System Stream Computing Data Warehouse Information Integration & Governance Big Data Infrastructure
Big data holistic approach: A platform Process any type of data Structured, unstructured, inmotion, at-rest, in-place Built-for-purpose engines Designed to handle different requirements Manage and govern data in the ecosystem Enterprise data integration Grow and evolve on current infrastructure The whole is greater than the sum of parts Integrated components Out of the box, standards-based services Start small (value is additive) Solutions Analytics and Decision Management Visualization & Discovery Hadoop System Big Data Platform Application Development Accelerators Stream Computing Information Integration & Governance Big Data Infrastructure Systems Management Data Warehouse 21
An example of the big data platform in practice Ingestion and Real-time Analytic Zone Streams Analytics and Reporting Zone Warehousing Zone BI & Reporting Connectors Hadoop Enterprise Warehouse Predictive Analytics MapReduce Hive/HBase Col Stores Data Marts Visualization & Discovery Documents in variety of formats ETL, MDM, Data Governance Landing and Analytics Sandbox Zone Metadata and Governance Zone
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
The 5 High Value Big Data Use Cases Big Data Exploration Find, visualize, understand all big data to improve business knowledge Enhanced 360 o View of the Customer Achieve a true unified view, incorporating internal and external sources Security/Intelligence Extension Lower risk, detect fraud and monitor cyber security in real-time Operations Analysis Analyze a variety of machine data for improved business results Data Warehouse Augmentation Integrate big data and data warehouse capabilities to increase operational efficiency
Big Data Exploration: Illustrated Find, visualize and understand all big data to improve business knowledge Greater efficiencies in business processes New insights from combining and analyzing data types in new ways Develop new business models with resulting increased market presence and revenue Streams Connector Framework Hadoop UI / User App Builder Integration & Governance Data Explorer Warehouse CM, RM, DM RDBMS Feeds Web 2.0 Email Web CRM, ERP File Systems
Big Data Exploration: Example in Practice Airplane Manufacturer Blinded for confidentiality Exploring 4 TB to drive point business solutions (supplier portal, call center, etc.) Single-point of data fusion for all employees to use Reduced costs & improved operational performance for the business Is Big Data Exploration Right for You? How do you separate the noise from useful content? How do you perform data exploration on large and complex data? How do you find insights in new or unstructured data types (e.g. social media and email)? How do you enable employees to navigate and explore enterprise and external content? Can you present this in a single user interface? How do you identify areas of data risk before they become a problem? What is the starting point for your big data initiatives? Big Data Platform Component Starting Point: Data Explorer
Enhanced 360º View of the Customer: Illustrated SOURCE SYSTEMS CRM Name: J Robertson Address: 35 West 15 th Address: Pittsburgh, PA 15213 ERP Name: Janet Robertson Address: 35 West 15 th St. Address: Pittsburgh, PA 15213 Legacy Name: Address: Jan Robertson 36 West 15 th St. Address: Pittsburgh, PA 15213 Master Data Management 360 View of Party Identity First: Last: Address: City: State/Zip: Gender: Age: DOB: Janet Robertson 35 West 15 th St Pittsburgh PA / 15213 F 48 1/4/64 Hadoop Streams Warehouse Unified View of Party s Information
Enhanced 360º View of the Customer: Insight from user s photos Pins / Re-pins Likes / Dislikes Tweets Favorites Photo Albums and Pinboards Photo Semantic Analysis User Segmentation Style Kitchen Gallery Consumer Dream Home Wedding Computer Advertisements Promotions Campaigns Planning Retailers, Marketers and Planners Preferred Styles Designs Products Interests 28
Enhanced 360º Customer View: Customer Example Leading Medical Equipment Supplier Blinded for confidentiality Increase revenue and decrease cost in the call center Increase customer & employee satisfaction Leverage new data types in customer analysis Is the Enhanced 360º Customer View Right for You? How do you identify and deliver all data as it relates to a customer, product, competitor to those to need it? How do you gather insights about your customers from social data, surveys, support emails, etc.? How do you combine your structured and unstructured data to run analytics? How are you driving consistency across your information assets when representing your customer, clients, partners etc.? How do you deliver a complete view of the customer enhance to your line of business users to ensure better business outcomes? Big Data Platform Component Starting Point: Data Explorer, Hadoop
Security/Intelligence Extension: Illustrated New Considerations Traditional Security Operations and Technology Logs Events Alerts Collection, Storage and Processing Big Data Analytics Configuration information System audit trails Network flows and anomalies External threat intelligence feeds Web page text E-mail and social activity Identity context Video/audio surveillance Business process data Customer transactions Collection and integration Size and speed Enrichment and correlation Analytics and Workflow Visualization Unstructured analysis Learning and prediction Customization Sharing and export
Reconstructing Events Integrating Multimedia from Diverse Sources Security Cameras Mobile Cameras Overhead Social Media 100K security cameras (static cameras, slowly changing topology) 10M mobile photos/day (limited knowledge about locations) 50M social media photos/video (uncertain geo-temporal context) Moving vehicles (patrol cars), overhead drones, broadcast, retail, 311, etc. Correlate multimedia content across a wide diversity of sources and dynamic topology of cameras Exploit partial overlaps in field of view, reidentification of objects/people and contextual information Obtain real-time operational picture across diverse content 31
Security/Intelligence Extension: Customer Example Captured and analyzed 42TB of daily traffic in real-time for tracking persons of interest to take suitable action and reduce risk. Would the Security / Intelligence Extension benefit you? What are your plans to enrich your security or intel system with unused or underleveraged data sources (video, audio, smart devices, network, Telco, social media)? How will you address the need sub second detection, identification, resolution of physical or cyber threats? How do you intend to follow activities of criminals, terrorists, or persons in a blacklist? How do you plan to enhance your surveillance system with real-time data from video, acoustic, thermal or other security sensors? Do you want to correlate lots of technical or human intel data and sources looking for associations or patterns (big data forensics)? How are you going to deal with unstructured data (email, social, etc.) in your Security Information & Event Management (SIEM) solution to improve cyber threat detection & remediation? Big Data Platform Component Starting Point: Streams, Hadoop
Operations Analysis: Illustrated Indexing, Search Raw Logs and Machine Data Only store what is needed Machine Data Accelerator Statistical Modeling Root Cause Analysis Real-time Analysis Federated Navigation & Discovery
Operations analysis is a Business Imperative Cost of System Down Time 49 percent of Fortune 500 companies experience > 80 hours of system down time/year 1 Cost of down time varies between $90,000/hr to $6.48 million/hr 80 hours * $6.48M = approx $500M per year System downtown costs North American businesses $26.5 billion a year in lost revenue 2 1 http://www.information-management.com/infodirect/2009_133/downtime_cost-10015855-1.html 2 http://www.itchannelplanet.com/business_news/article.php/3916786/it-system-downtime-costs-265-billion-a-year-study-finds.htm
Operations Analysis: Customer Example Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos Would Operations Analysis benefit you? Do you deal with large volumes of machine data? How do you access and search that data? How do you perform root cause analysis? How do you perform complex real-time analysis to correlate across different data sets? How do you monitor and visualize streaming data in real time and generate alerts? Big Data Platform Component Starting Point: Hadoop, Streams
Data Warehouse Augmentation: Needs Integrate big data and data warehouse capabilities to increase operational efficiency Need to leverage variety of data Structured, unstructured, and streaming data sources required for deep analysis Low latency requirements (hours not weeks or months) Required query access to data Extend warehouse infrastructure Optimized storage, maintenance and licensing costs by migrating rarely used data to Hadoop Reduced storage costs through smart processing of streaming data Improved warehouse performance by determining what data to feed into it
Data Warehouse Augmentation: Illustrated Hadoop Filter and summarize big data for the warehouse
Data Warehouse Augmentation: Illustrated Hadoop Hadoop as a query-ready archive for a data warehouse
Data Warehouse Augmentation: Customer Example Improved analysis performance by over 40 times, reduced wait time from hours to seconds, and increased campaign effectiveness by 20+%. Could Data Warehouse Augmentation benefit you? Are you drowning in very large data sets (TBs to PBs) that are difficult and costly to store? Are you able to utilize and store new data types? Are you facing rising maintenance/licensing costs? Do you use your warehouse environment as a repository for all data? Do you have a lot of cold, or low-touch, data driving up costs or slowing performance? Do you want to perform analysis of data in-motion to determine what should be stored in the warehouse? Do you want to perform data exploration on all data? Are you using your data for new types of analytics? Big Data Platform Component Starting Point: Hadoop, Streams
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
Sentiment Analysis using IBM Text Analytics (Basic example) 2013 BigDataUniversity.com
Sentiments for movie Ra.One :-( 2013 BigDataUniversity.com
Sentiments for movie Swades :-) 2013 BigDataUniversity.com
Architecture Diagram Annotated Document Stream AQL Text Analytics Optimizer Compiled Operator Graph (.aog) Text Analytics Runtime Rule language with familiar SQL-like syntax Specify annotator semantics declaratively Choose an efficient execution plan that implements the semantics Highly scalable, embeddable Java runtime Input Document Stream 2013 BigDataUniversity.com
How Streams Works continuous ingestion Continuous ingestion Continuous analysis
How Streams Works Continuous ingestion Continuous analysis Filter / Sample Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity Transform Annotate Correlate Classify Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts Where appropriate: Elements can be fused together for lower communication latency
Scalable Stream Processing Streams programming model: construct a graph Mathematical concept not a line -, bar -, or pie chart! OP Also called a network OP Familiar: for example, a tree structure is a graph Consisting of operators and the streams that connect them The vertices (or nodes) and edges of the mathematical graph A directed graph: the edges have a direction (arrows) Streams runtime model: distributed processes Single or multiple operators form a Processing Element (PE) Compiler and runtime services make it easy to deploy PEs On one machine Across multiple hosts in a cluster when scaled-up processing is required All links and data transport are handled by runtime services Automatically With manual placement directives where required OP OP stream OP OP OP
From Essential Elements to Running Jobs Streams application graph: A directed, possibly cyclic, graph A collection of operators Src Connected by streams Each complete application is a potentially deployable job Src OP stream OP OP Sink Sink Jobs are deployed to a Streams runtime environment, known as a Streams Instance (or simply, an instance) An instance can include a single processing node (hardware) Or multiple processing nodes node node h/w node node node node node node Streams instance
Streams Runtime Illustrated Meters Company Filter Usage Model Temp Action Optimizing scheduler assigns jobs to hosts, and continually manages resource allocation Usage Contract Text Extract Season Adjust Daily Adjust Commodity hardware laptop, blades or high performance clusters x86 host x86 host x86 host x86 host
Streams Runtime Illustrated Optimizing scheduler assigns PEs to hosts, and continually manages resource allocation Commodity hardware laptop, blades or high performance clusters Dynamically add hosts and jobs New jobs work with existing jobs Meters Meters Company Filter Usage Model Temp Action Usage Contract Text Extract Season Adjust Daily Adjust Text Extract Degree History Compare History Store History x86 host x86 host x86 host x86 host x86 host
Streams Runtime Includes High Availability PEs on busy hosts can be moved manually by the Streams administrator A PE failing on one host can be moved automatically to another; communications are automatically rerouted Meters Meters Company Filter Usage Model Temp Action Usage Contract Text Extract Season Adjust Daily Adjust Text Extract Degree History Compare History Store History x86 host x86 host x86 host x86 host x86 host
Social Data Analytics Accelerator Architecture Social Media Online flow: Data-in-motion analysis Stream Computing and Analytics Real time analytics. Pre-defined views and charts Data Ingest and Prep Extract Buzz, Intent, Sentiment Entity Analytics: Profile Resolution Dashboard Social Media Data BigInsights System and Analytics Extract Buzz, Intent, Sentiment And Consumer Profiles Entity Analytics and Integration Comprehensive Social Media Customer Profiles Pre-defined Workbooks and Dashboards Offline flow: Data-at-rest analysis Data Explorer Index using Push API Ad hoc access Optional: Indexed Search
Social Data Analytics Accelerator
Machine Data Analytics Accelerator Preventing outages Data Administrator Data Scientist End User Import Logs Extract Transform Analyze Visualize Business requirement Improve ability to understand, correct and anticipate outages Solution Overview Provide faceted search across log records from multiple systems to find events Link and correlate events across systems Discover interesting patterns Solution Detail BigInsights applications for Import, Extract, Transform, Analyze, Visualize
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
The Future of Big Data and Cloud SQL for Hadoop support improvements towards full ANSI support Hive Impala (Cloudera) Big SQL (IBM) Stinger (Hortonworks) Drill (MapR) HAWQ (Pivotal) SQL-H (Teradata) Improvements in Multimedia Analytics Growth in usage and adoption of R programming language Cloud Bare metal support helping with Hadoop workloads Private network Full support with APIs
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
Agenda The state of Big Data adoption Big Data A holistic approach The 5 high value Big Data use cases Technical details of key Big Data components The future of Big Data and Cloud Demos Resources
Big Data University (bigdatauniversity.com) BigInsights on the Cloud - Making Learning Hadoop Easy Flexible on-line delivery allows and Fun learning @your place and @your pace Free courses, free study materials. Cloud-based sandbox for exercises zero setup with Robust Course Management System and Content Distribution infrastructure 108,000 registered students. Free IBM Hadoop, BigInsights Publications
Big Data University (bigdatauniversity.com) BigInsights on the Cloud - Making Learning Hadoop Easy and Quick Fun Start Editions available (Free, nonproduction, no time bomb): IBM InfoSphere BigInsights (IBM s Hadoop Distribution) ibm.co/quickstart IBM InfoSphere Streams ibm.co/streamsqs
My contact information My contact information Contact Info: Email: rfchong@ca.ibm.com Twitter: @raulchong Facebook: facebook.com/raul.f.chong LinkedIN: linkedin.com/pub/raul-f-chong/8/aa2/b63 61
Thank You! 2013 BigDataUniversity.com