The Deal About
Analytics and the Context Multiplier Actuarial data Epidemic data Government statistics Patient records Weather history... Location risk Occupational risk Raw Data Feature extraction metadata Travel history Dietary risk Family history... Personal financial situation Domain linkages Chemical exposure Full contextual analytics Social relationships
New Era of Cognitive Computing Tabulating Systems Programmable Systems Cognitive Systems
IBM Watson IBM Watson is a breakthrough in analytic innovation, but it is only successful because of the quality of the information from which it is working.
Watson Engagement Advisor Customer engagement through cognitive computing What it does Automates customer interaction to increase customer engagement in sales and service Transforms customer engagement by knowing, engaging and empowering clients Develops customer relationships through a transformative user experience How it does it Provides answers not links and webpages Answers with evidence not guesses Not restricted to a predefined questionanswer set Learns from every interaction
Watson Discovery Advisor Answer previously unanswerable research problems Gain Awareness Harness all available scientific knowledge in the hunt for a breakthrough and identifies better leads for any researcher to pursue Understand Relationships Enable every scientist to identify new relationships and explore never before considered options that lead to real differentiating scientific innovations. Clarify Ideas Drive insights from scientists who ve made recent advances to peers, who can accelerate findings and raise productivity of the entire R&D group 6 Watson can read these medical records in six seconds!
Big Data Definition Volume Variety Velocity Veracity Data at Scale Data in Many Forms Data in Motion Data Uncertainty Big Data is data that can t be stored or analyzed using traditional tools.
without analytics BigData is just a bunch of data MYTH: Big Data is only about large datasets; we should just say larger than what you have MYTH: Big Data means Hadoop..that s it MYTH: Big Data means rip-and-replace, death to the RDBMS and no governance MYTH: NoSQL means no SQL, never, utter hatred for SQL MYTH: Big Data means unstructured data and only for sentiment
In 2005 there were 1.3 billion RFID tags in circulation around the world by the end of 2011, this was about 30 billion and growing even faster. 9
An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of data with MACHINE SPEED characteristics 1 BILLION lines of code EACH engine generating 10 TB every 30 minutes!
Applications for Big Data Analytics Smarter Healthcare Multi-channel sales Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO
Automatic Temporal and Spatially Enriched Data 12
Use Cases: Law Enforcement and Security Video surveillance, wire taps, communications, call records, etc. Millions of messages per second with low density of critical data Identify patterns and relationships among vast information sources The US Government has been working with IBM Research since 2003 on a radical new approach to data analysis that enables high speed, scalable and complex analytics of heterogeneous data streams in motion. The project has been so successful that US Government will deploy additional installations to enable other agencies to achieve greater success in various future projects US Government
Velocity Creating Actionable Intelligence in Real Time Example Analytics:
Volume - The Government Industry Data Challenge IBM Multimedia Analysis & Retrieval Automatic Semantic Classification of Images and Video Content based feature extraction & Search Gigapixel Panorama Photography http://www.gigapixel.com/image/gigapancanucks-g7.html Think Big!
Predictive Analytics in a Neonatal ICU Real-time analytics and correlations on physiological data streams Blood pressure, Temperature, EKG, Blood oxygen saturation etc., Early detection of the onset of potentially life-threatening conditions Up to 24 hours earlier than current medical practices Early intervention leads to lower patient morbidity and better long term outcomes Technology also enables physicians to verify new clinical hypotheses
Why Didn t We Use All of the Big Data Before?
Warehouse Modernization Has to Themes Traditional Analytics Structured & Repeatable Structure built to store data Hypothesis Question? Big Data Analytics Iterative & Exploratory Data is the structure Data All Information Exploration Analyzed Information Answer Data Start with hypothesis Test against selected data Actionable Insight Correlation Data leads the way Explore all data, identify correlations 18 Analyze after landing Analyze in motion
Analyze all TRADITIONAL APPROACH BIG DATA APPROACH All available information Analyzed information All available information analyzed Analyze small subsets of information Analyze all information
Analyze as is TRADITIONAL APPROACH BIG DATA APPROACH Small amount of carefully organized information Large amount of messy information Carefully cleanse information before any analysis Analyze information as is, cleanse as needed
Find corellation TRADITIONAL APPROACH BIG DATA APPROACH Hypothesis Question Data Exploration Answer Data Insight Correlation Start with hypothesis and test against selected data Explore all data and identify correlations
Analyze in motion TRADITIONAL APPROACH BIG DATA APPROACH Data Analysis Data Repository Analysis Insight Insight Analyze data after it s been processed and landed in a warehouse or mart Analyze data in motion as it s generated, in real-time
Complementary Analytics Traditional Approach Structured, analytical, logical New Approach Creative, holistic thought, intuition Internal App Data Mainframe Data Transaction Data OLTP System Data Data Warehouse Structured Repeatable Linear Hadoop and Streams Unstructured Exploratory Dynamic Multimedia Web Logs Social Data Text Data: emails Sensor data: images ERP Data RFID Traditional Sources New Sources 23
The NoSQL Revolution Different requirements require different tools Document stores Key/value stores BigTable implementations (columnar) Graph databases Values (there are exceptions) Huge data volumes easy scale-out Developers code integrity if it s needed Relaxed (eventual) consistency Semi-structured data Schema on read
Why NoSQL? Pressures on Traditional Relational Stores Budgetary constraints Technical change/ Different forms of data Regulatory pressures (SLAs, Archive, Governance)
Database Landscape Overview Description SQL nosql database Hadoop Relational SQL (RDBMS) Operational and Analytic E.g. DB2, Oracle, Microsoft, Teradata, etc. nosql database Mainly operational E.g. Cloudant, MongoDB, Redis, Riak, Aerospike, Amazon Dynamo DB, etc. SQL on Hadoop (mainly analytic) HBase (evolving OLTP, ACID) E.g. BigInsights, Cloudera, Hortonworks, MapR, Pivotal HP Labs Trafodion Typical Infrastructure Proprietary database storage Unix, Linux, Windows SMP, MPP, SAN, Integrated Systems, Appliances Proprietary database storage Linux Commodity clusters Local attach disks, NAS Cloud Mobile HDFS files Linux Commodity clusters Local attach disks Primary Driver Traditional I/T ACID Developer Agility, scalability, workload, cost Lower Cost All types of data
Different Categories of nosql Databases NoSQL Category Use this when. Application Examples Vendors Document 63% revenue share* Schema is not well defined Schema is very likely to change, need to maintain flexibility Commonly described with JSON Frequently changing product catalogs Cloudant** MongoDB Couchbase MarkLogic Key-Value 24% revenue share* Your data is not highly related All you need is basic Create, Read, Update, Delete (CRUD) Rapid Scaling for simple data collections User Sessions Shopping Cart Redis Aerospike AWS (DynamoDB) Basho Technologies (Riak) BigTable/ Columnar 9% revenue share* High volume, low latency write Big Data, sparse data Need compression or versioning Telco, heavy ingest, petabyte scale User Activity logs Sensor data HBase (Hadoop)** BigTable Cassandra Graph DB 4% revenue Share* Your data looks like a graph Have highly interconnected data, need to trace relationships Website Purchase Recommendations Social Network Processing Titan** Neo Technology (Neo4J) * Source: IBM study 2013 estimated by splitting total nosql revenue ($288m) by ratio of top 10 vendors reported 2013 revenue. Total 2013 nosql database revenue estimated $343m ** IBM Solutions of Choice.
Hadoop Open-source software framework from Apache Inspired by Google MapReduce GFS (Google File System) HDFS Map/Reduce
Hadoop Explained Hadoop computation model Data stored in a distributed file system spanning many inexpensive computers Bring function to the data Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of data public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); Hadoop Data Nodes public void map(object key, Text val, Context StringTokenizer itr = new StringTokenizer(val.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWrita private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> val, Context context){ int sum = 0; for (IntWritable v : val) { sum += v.get();... apreduce Application Distribute map tasks to cluster Shuffle 1. Map Phase (break job into small parts) 2. Shuffle (transfer interim output for final processing) 3. Reduce Phase (boil all output down to a single result set) Result Set Return a single result set
Big Data Enterprise platform Visualization & Discovery Applications & Development Administration Integration Big SQL JDBC BigSheets Dashboard & Visualization Apps Workflow Text Analytics Pig & Jaql MapReduce Hive Admin Console Monitoring Netezza DB2 Advanced Analytic Engines Adaptive Algorithms Text Processing Engine & Extractor Library) R Streams DataStage Workload Optimization Integrated Installer ZooKeeper Enhanced Security Oozie Splittable Text Compression Jaql Adaptive MapReduce Flexible Scheduler High Availability Guardium Platform Computing Lucene Pig H Catalog Index Cognos Runtime MapReduce Management Security Flume Data Store HBase Hive Audit & History Sqoop File System HDFS GPFS Lineage Open Source IBM
Future: The SQL interface.... Rich SQL query capabilities SQL '92 and 2011 features Correlated subqueries Windowed aggregates SQL access to all data stored in InfoSphere BigInsights Robust JDBC/ODBC support Take advantage of key features of each data source Application SQL Language JDBC / ODBC Driver JDBC / ODBC Server SQL interface Engine Leverage MapReduce parallelism OR achieving low-latency Data Sources HiveTables HBase tables CSV Files InfoSphere BigInsights
Spreadsheet-style Analysis Web-based analysis and visualization Spreadsheet-like interface Define and manage long running data collection jobs Analyze content of the text on the pages that have been retrieved
BigInsights Text Analytics Statistical Analysis (R module) Machine learning (SystemML) Ad-Hoc analysis (BigSheets) (Integration) DB2, Netezza, Streams, JAQL IBM s programming language in hadoop world Jaql is a complete solutions environment supporting all other BigInsights components Integration point for various analytics Text analytics Statistical analysis Machine learning Ad-hoc analysis Integration point for various data sources Local and distributed file systems NoSQL data bases Content repositories Relational sources (Warehouses, operational data bases) Jaql I/O Jaql Jaql Core Operators DFS NoSQL RDBMS Jaql Modules File System
Data In Motion and At Rest: Complementary Unit of analysis High 1PB 100TB 10TB Warehouse/ Hadoop At Rest: Warehouse/Hadoop -Scalable processing of huge data stores 1TB Sweet spot Capability Med 100GB 10GB GB Warehouse In Motion: Streams - scalable low-latency processing of stream data Low MB KB Streams Streams Sweet spot Capability B ms ms sec min hr day wk mo Low Med High yr Latency
Streams Analyzes All Kinds of Data Text (listen, verb), (radio, noun) Mining in Microseconds (included with Streams) ***New*** Simple & Advanced Text (included with Streams) Predictive (IBM Research) Geospatial (IBM Research) Acoustic (IBM Research) (Open Source) ( population R s t, a t ) Image & Video (Open Source) Advanced Mathematical Models (IBM Research) Statistics (included with Streams)
How Streams Works continuous ingestion Continuous ingestion Continuous analysis 36 2013 IBM Corporation
How Streams Works Continuous ingestion Continuous analysis Filter / Sample Infrastructure provides services for Scheduling analytics across hardware hosts, Establishing streaming connectivity Transform Annotate Correlate Classify Achieve scale: By partitioning applications into software components By distributing across stream-connected hardware hosts Where appropriate: Elements can be fused together for lower communication latency 37 2013 IBM Corporation
Streams Runtime Supports Placement Criteria Host pools can force operators to be on hosts with soliddb installed Operator placement constraints allow for co-location, ex-location, and isolation of operators soliddb could be wrapped as a custom operator for dynamic deployment and relocation Meters Meters Company Filter Usage Model Temp Action Usage Contract Text Extract Season Adjust Daily Adjust Text Extract Degree History Compare History Store History x86 host x86 host x86 host x86 host x86 host
Data Warehouse Augmentation: Value & Diagram Pre-Processing Hub Query-able Archive Exploratory Analysis 1 2 3 Streams Real-time processing Data Explorer BigInsights Landing zone for all data BigInsights Information Integration Data Explorer Find and view the data Can combine with unstructured information BigInsights Streams Offload analytics for microsecond latency Data Warehouse Data Warehouse Data Warehouse 39 2013 IBM Corporation
Individual Silos can Answer Typical Questions, One-by-One Who is this customer? What products can I upsell this customer? What products has she purchased? What impact will inventory have on her? What issues has this customer What marketing had the past? materials should I send? What is her view of our company? What should I know before calling her for Where renewal? else has she worked? What s going on with What this customer is available inventory? TODAY? How can we increase How is her company engagement with her? using our products? How can we get more Who customers is best like able her? to help this customer? CRM DBMS Support Ticketing Social Media External Sources Supply Chain Content Mgt. Experts Email Fulfillment Wiki BUT! An enhanced 360º view provides answers in one application Fusion of data from multiple systems enables deeper insights not just facts 40
Customer search Janet Robertson Transaction history Customer s Products Salesforce Customer info SAP Systems Microsoft Dynamics SharePoint Unstructured internal information related to customer Indexed 3 rd party information related to customer
IBM Cloud Offering for Analysts: Watson Analytics Natural language dialogue Data access and refinement Integrated social business Intelligent automation Report and dashboard creation Visual storytelling Guided data discovery Unified analytics experience 100% cloud based Mobile ready
The IBM Big Data Platform Hadoop-based low latency analytics for variety and volume Hadoop Information Integration Stream Computing High volume data integration and transformation Low Latency Analytics for streaming data MPP Data Warehouse Large volume structured data analytics Queryable Archive Structured Data BI+Ad Hoc Analytics on Structured Data Operational Analytics on Structured Data Time-structured analytics
Data Refineries Some water can be consumed raw Water is treated at source Heated Ready for consumption Softened Charcoal Filter Pumped into Landing Zone Reverse Osmosis
Data Reservoir: Refinery Services Information Governance Catalog Metadata for Data Sets Stored in Reservoir Repositories Integration Load Trickle feed Operational Systems Transactional DB Document Storage Landed Raw Data Landing, Exploration, Archive Discovery Sandbox Transformation Staging Trusted Data, Warehousing Deep Analytics, Modeling Analytic Appliance Reporting, Interactive Analysis Security Masking Test data generation NoSQL Doc Store Hadoop Mixed Workload RDBMS Data Mart Data Quality Cleansing Standardization Matching Reference data generation Data Reservoir Repositories (Zones) IBM DataWorks Data Lifecycle Archiving
Information Management Zones Data Types Real-Time Analytical Processing Actionable Insight Machine and Sensor Data Operational Systems Landing, Exploration, Archive Trusted Data, Warehousing Deep Analytics, Modeling Decision Management Image and Video Enterprise Content Transaction and Application Data Transactional DB Document Storage Landed Raw Data Discovery Sandbox Transformation Staging Analytic Appliance Reporting, Interactive Analysis Predictive Analytics, Modeling Reporting, Analysis Social Data Third-Party Data NoSQL Doc Store Hadoop Mixed Workload RDBMS Data Mart Governance and Lifecycle Management Fabric Integration Matching Masking Lineage Security Privacy Glossary Discovery, Exploration Mainframe, Power8, Intel, Cloud (Managed/Hosted), Bluemix Services
Emerging Big Data Implementation Pattern Ingestion and Real-time Analytic Zone Ingest Filter, Transform Analytics and Reporting Zone Correlate, Classify Warehousing Zone Query Engines Cubes Data Sinks Connectors Extract, Annotate Landing and Analytics Sandbox Zone Enterprise Warehouse Descriptive, Predictive Models Analytics MapReduce Hive/HBase Col Stores Indexes, facets Data Marts Widgets Discovery, Visualizer Search Ingest Documents In Variety of Formats Models Metadata and Governance Zone Repository, Workbench
IBM InfoSphere BigInsights Enterprise Edition Visualization & Discovery Applications & Development Administration Integration Big SQL JDBC BigSheets Dashboard & Visualization Apps Workflow Text Analytics Pig & Jaql MapReduce Hive Admin Console Monitoring Netezza DB2 Advanced Analytic Engines Adaptive Algorithms Text Processing Engine & Extractor Library) R Streams DataStage Workload Optimization Integrated Installer ZooKeeper Enhanced Security Oozie Splittable Text Compression Jaql Adaptive MapReduce Flexible Scheduler High Availability Guardium Platform Computing Lucene Pig H Catalog Index Cognos Runtime MapReduce Management Security Flume Data Store HBase Hive Audit & History Sqoop File System HDFS GPFS Lineage Open Source IBM
Integration Integration Integration Caixabank Big Data Reference Architecture CaixaBank Electronic Journal (structured) CaixaBank at rest / in motion (unstructured) Text Analytics Predictive Model Integration Streams (Data in Motion) Real Time Event Detection Big Data (Data At Rest) Offers Creation and Management System Marketing unstructured data Text Analytics Deep Analytics Pattern Detection Matching System External Social Media (unstructured) structured data Integration Multichannel Notification System CaixaBank operational system (structured) Datawarehouse Customers Profiles 50
Integration Integration Integration Caixabank Big Data Reference Architecture CaixaBank Electronic Journal (structured) CaixaBank at rest / in motion (unstructured) Text Analytics Predictive Model Integration Streams (Data in Motion) Real Time Event Detection Big Data (Data At Rest) Offers Creation and Management System Marketing unstructured data Text Deep Deep Analytics Analytics Analytics Pattern Detection Matching System External Social Media (unstructured) CaixaBank operational system (structured) structured data Datawarehouse Integration Deep Analytics (Research, Existing, Third-party) Behavior Analysis Data linkage Customers Profiles Location Based Analysis Multichannel Sentiment Analysis Notification System Concept Labeling & Classification Intent Analysis Influence Analysis Topic Detection 51
IBM Cloud Offering for Developers: Bluemix
Why are Developers Using Bluemix? To rapidly bring products and services to market at lower cost Go from zero to running code in a matter of minutes. To continuously deliver new functionality to their applications Automate the development and delivery of many applications. To extend existing investments in IT infrastructure Extend existing investments by connecting securely to on-premise infrastructure.
Cloudant: Database as a Service (Documents) Infrastructure Services Mobile Database as-a-service Systems of Engagement Social Internet of Things Embedded Systems Analytics Systems of Record SQLDB Relation DB
SQLDB: Database as a Service (Relational)
dashdb: Data Warehouse as a Service Build More Netezza Analytics Deploy in hours with rapid cloud provisioning No infrastructure investment for cloud agility Cloud dashdb Grow More Load and Go with no tuning required Columnar optimized for analytic workloads Memory optimized takes analytics beyond in-memory BLU Acceleration 3 rd Party DW Know More In-Database analytics built in R Integration for predictive modeling Partner Ecosystem for analytics IBM Watson Analytics ready
Enterprise Hadoop as a Service (EHaaS)
58 @BigData_paulz THINK