Transforming Government with Big Data and Analytics Deepak Mohapatra Sr. Consultant IBM Software Group dmohapatra@us.ibm.com April 29 th 2014 1
Big Data Creates A Challenge And an Opportunity Yet requires a shift in thinking and an evolution of approach All perspectives Past (historical, aggregated) Present (real-time) Future (predictive) All people All departments Experts and non-experts Executives and employees Partners and customers All decisions Major and minor Strategic and tactical Routine and exceptions Manual and automated From structured, linear, repeatable, IT-driven, information delivery To creative, dynamic, iterative, business-driven analytics environment Data Media Content Machine Social All information analytics is moving closer to the data
Big Data Creates A Challenge And an Opportunity What If You Could...(Four Key Paradigm Shifts) BIG DATA TRADITIONAL & ANALYTICS APPROACH BIG DATA TRADITIONAL & ANALYTICS APPROACH All available information All Analyzed available information analyzed Small amount of carefully organized Large information amount of messy information Analyze small Analyze subsets of information all information Leverage more of the data being captured Carefully Analyze cleanse information as is, before any cleanse analysis as needed Reduce effort required to leverage data BIG DATA TRADITIONAL & ANALYTICS APPROACH BIG DATA TRADITIONAL & ANALYTICS APPROACH APPROACH Data Hypothesis Data Exploration Question Analysis Data Repository Analysis Insight Answer Insight Correlation Data Insight Start with Explore hypothesis all data and and test against identify selected correlations data Data leads the way and sometimes correlations are good enough Analyze Analyze data data after in it s motion been processed as it s and landed generated, in a warehouse in real-time or mart Leverage data as it is captured
Big Data Creates A Challenge And an Opportunity Harnessing Big Data For Analytics BIG DATA TRADITIONAL & ANALYTICS APPROACH APPROACH BIG DATA TRADITIONAL & ANALYTICS APPROACH APPROACH What What will has happened and what and why should you do Predict and decide the best action the realm embedded of the specialist in everything Intuitive analytics for everyone BIG DATA TRADITIONAL & ANALYTICS APPROACH APPROACH BIG DATA TRADITIONAL & ANALYTICS APPROACH Pre-programmed Learn to sense analysis and predict using on structured all types data of information Cognitive computing Scheduled Real-time Analytics as and when you need it
Leading The Way Government Agencies Harnessing The Value of Big Data Improved Public Safety And Mass Casualty Incident Management 60X Acceleration in Active Wild Fire behavioral analysis, with greater accuracy Smarter Cities Reduced traffic congestion, shorter travel times, enhance emergency services Maritime Emergency Management Early and accurate detection and notification of marine resource adverse events Insider Threat Perpetual credentialing and vetting across branches and bases Smarter Water 5 $120M in savings, reduction in reportable incidents
Leading The Way Government Agencies Harnessing The Value of Big Data National Border & Security State of The Art covert surveillance system based on Streams platform Smarter Social Services: Smarter Health Care Analytics Optimized Neo-Natal care through early detection, prevention, testing hypotheses Warfighter Care Greatly reduced frequency and severity of Traumatic Brain Injury Smarter Energy: Wind Infrastructure Optimized wind turbine energy production with longer life span and reduced maintenance Smarter HealthCare Analytics High performance analytics across a vast volume of data to spot hidden trends and relationships with Netezza
Leading The Way Government Agencies Harnessing The Value of Big Data Smarter Energy Enhanced collaboration across the electrical grid for greater demand and maintenance efficiency Smarter Resource Management: Watershed Intelligence Early and accurate detection and notification of events effecting watershed Smarter Social Services In Four Hours Identified $140M in Improper Payments Social Media Analytics Efficiently read, process and analyze large volumes of political debate-related, public feedback in real-time Optimized Healthcare Outcomes Asian Health Bureau Significantly improved Healthcare outcomes through automated image analysis & Information Sharing
Leading the Way City of Boston 8
Leading the Way Naval & Maritime Threat Intelligence 9
What We Have Learned A Deliberate Approach Is Required These experiences reveal a great irony -- that while the impact of Big Data will be transformational, the path to effectively harnessing it is not. The journey is evolutionary versus revolutionary, incremental and iterative Demystifying Big Data, TechAmerica Report, October 2012 1. Start with a clear business requirement, 2. Explore the Art of the Possible and define a discrete set of high value use cases. 3. Discover & Assess - Take inventory and understand your data assets, and assess your current against what is required to support your inijal use cases. 4. Select & Plan your inijal project 5. Deploy & Manage Deploy capabilijes to support the inijal use case
Major Capabilities 11
Next Generation Analytics Reference Architecture Open Architecture/ Multiple Entry Points Internal Databases Real-time Analytics Zone Content Repositories External Federated Data Data Ingestion & Integration Zone Landing Zone (Hadoop) Data Warehouse & Marts Zone Analytics, Visualization and Consumption Social Media Analytics Appliances Information Governance, Security and Business Continuity 12
Next Generation Architecture Zones Data in Mo)on Data at Rest Streams Data Integration Zone Stream Processing Data Integration Data Federation Data Quality Federation Real-time Analytics Zone Video/Audio Text Mining Network/Sensor Entity Analytics Predictive Landing Zone (Hadoop) Raw Data Structured Data Unstructured Data Text Analytics Data Mining Entity Analytics Machine Learning Data Warehouse & Marts Zone Structured Data Discovery Deep Reflection Operational Predictive Matching and Link Analysis Identity Resolution Matching and Linking Network Analysis Stewardship Reference Data Predictive Analytics Computational Statistics Business Intelligence Data Exploration & Visualization Collaboration Social Networking Inspectors Investigators Researchers Administrators Others Data in Many Forms Information Governance, Security & Business Continuity 13
Next Generation Architecture Zones with IBM Products Mapped Data in Mo)on Data at Rest Data in Many Forms 14 Streams Data Integration Zone Stream Processing Data Integration Data Federation Data Quality Federation Information Server InfoSphere Streams Real-time Analytics Zone Video/Audio Text Mining Network/Sensor Entity Analytics Predictive Landing Zone (Hadoop) Raw Data Structured Data Unstructured Data Text Analytics Data Mining Entity Analytics Machine Learning InfoSphere BigInsights, SPSS Data Warehouse & Marts Zone Structured Data Discovery Deep Reflection Operational Predictive Matching and Link Analysis Information Governance, Security & Business Continuity Optim, Guardium PureData for Analytics (Netezza) Identity Resolution Matching and Linking Network Analysis Stewardship Reference Data MDM, SPSS Predictive Analytics Computational Statistics Business Intelligence Data Exploration & Visualization SPSS Modeler SPSS Statistics Cognos Collaboration Social Networking Inspectors Investigators Researchers Administrators Others Data Explorer, i2 Analyst Notebook
Analytics and Visualization 15
Law Enforcement increasingly uses analytics to drive investigative and operational improvements to meet business challenges Analytic Technique Critical Business Question Competitive Advantage Stochastic Optimization Optimization Predictive modeling Forecasting Simulation Alerts Query/drill down Ad hoc reporting Standard Reporting Degree of Complexity How can we achieve the best outcome including the effects of variability? How can we achieve the best outcome? What will happen next if? What if these trends continue? What could happen.? What actions are needed? What exactly is the problem? How many, how often, where? What happened? Advanced Analytics Prescriptive and Predictive Support new business models and opportunities Operational Analytics Support ongoing business operations Meet compliance requirements Based on: Competing on Analytics, Davenport and Harris, 2007 16
Analytics disciplines are powerful force multipliers and critical for success Structured data and unstructured content - made consumable and accessible appropriately Who is who? Who knows who? What is the nature of their relationship? How are persons, objects, locations and events connected? Entity Analytics What is happening? How many, how often, where? What exactly is the problem? What actions are needed? Descriptive Analytics What could happen? Simulation What if these trends continue? Forecasting What will happen next if? Predictive Modeling Predictive Analytics How can we achieve the best outcome? Optimization How can we achieve the best outcome and address variability? Stochastic Optimization Prescriptive Analytics Extracting insight, concepts and relationships from unstructured volumes Content Analytics Insights and intelligence from streaming data sources and the internet Web/Social Analytics 17
Data Integration Zone 18
The IBM Solution: IBM Information Server Delivering information you can trust IBM Information Server Unified Deployment Understand Cleanse Transform Deliver Discover, model, and govern information structure and content Standardize, merge, and correct information Combine and restructure information for new uses Synchronize, virtualize and move information for in-line delivery Unified Metadata Management Parallel Processing Rich Connectivity to Applications, Data, and Content 19
Managing the Information Lifecycle IBM Information Server Design Understand WHAT IS REQUIRED? WHAT ASSETS EXIST? WHAT IS MY QUALITY? Plan Discover Analyze WHAT IS THE SOLUTION? Define DATA INTEGRATION SOURCE SYSTEMS HOW CAN THIS SOLUTION BE BUILT? Develop WAREHOUSE MASTER DATA Govern WHAT IS THE INFRASTRUCTURE? IS MY INFORMATION STILL TRUSTED? Deploy Monitor HOW CAN EFFICIENCY BE IMPROVED? Optimize IS MY INFORMATION WELL MANAGED? Manage DATAMARTS DATA INTEGRATION OLAP REPORTS BUSINESS INTELLIGENCE REPORT LINEAGE Business Users 20 Subject Ma9er Experts Architects Data Analysts Developers Stewards
Warehousing and Marts Zone (Structured Data) 21
Recommended Warehousing Strategy Transactional Sources Information Integration Zone Data Warehouse and Marts Zone Other Sources Atomic Warehouse Mart-1 Power Users Landing Zone Traditional structured data analytics Data model to enable data integration processes Data value monitoring of thresholds and alerting Business intelligence reporting, slice-n-dice, dashboards, and scorecards Atomic: Detailed Historical Validated, Clean, Standardized Mart-2. Mart-n Marts: Aggregated Purpose-built Re-creatable Consumption focused Targeted User Communities E=Extract; T=Transform; L=Load 22
PureData System for Analytics (PDA) the Data Warehousing Appliance Simplify Move analytics into the Data Warehouse Integrate the server, storage and database into one optimized package Move complex analytics into the database Integrated, high performance analytics within the data warehouse Analytics Database Storage Server 23
Integrated by Design In-Database Analytics 2.0 Transformations Mathematical Geospatial Predictive Statistics Time Series Data Mining ü No data movement ü Analyze deep and wide data ü High performance, parallel computation 24
Spend Less Time Managing and More Time Innovating ü Easy Administration Portal ü No software installation ü No indexes and tuning ü No storage administration Simplicity and Ease of Administration No dbspace/tablespace sizing and configuration No redo/physical/logical log sizing and configuration No page/block sizing and configuration for tables No extent sizing and configuration for tables No Temp space allocation and monitoring No RAID level decisions for dbspaces No logical volume creations of files No integration of OS kernel recommendations No maintenance of OS recommended patch levels No JAD sessions to configure host/network/storage Data Experts, not Database Experts 25
Landing Zone for Structured and Unstructured 26
What is Hadoop? Apache Hadoop = free, open source framework for data-intensive applications Inspired by Google technologies (MapReduce, GFS) Well-suited to batch-oriented, read-intensive applications Originally built to address scalability problems of Nutch, an open source Web search technology Enables applications to work with thousands of nodes and petabytes of data in a highly parallel, cost effective manner CPU + disks of commodity box = Hadoop node Boxes can be combined into clusters New nodes can be added as needed without changing Data formats How data is loaded How jobs are written 27
From Getting Starting to Enterprise Deployment: Enterprise class Different BigInsights Editions For Varying Needs PureData for Hadoop - Appliance simplicity Enterprise Edition Sold by # of terabytes managed Quick Start Edition New for V2.1. Free. Non-production only Basic Edition Free download - Jaql - Integrated install Apache Hadoop 28 - Accelerators - Performance Optimization - Visualization Capabilities - Pre-built applications - Text analytics - Spreadsheet-style tool - RDBMS, warehouse connectivity - Administrative tools, security - - Eclipse development tools - - Enterprise Integration - - Integrated web console... PureData System for Hadoop brings BigInsights As an appliance form factor to the market Breadth of capabilities
BigInsights Enterprise Edition Open Source IBM Optional IBM and partner offerings Infrastructure Integrated installer Text compression Analytics and discovery Text processing engine and library BigSheets Enhanced security Indexing Accelerator for social data analysis Accelerator for machine data analysis Big SQL Oozie Lucene Apps Web Crawler Boardreader Distrib file copy... Jaql HBase ZooKeeper DB export DB import Ad hoc query Pig Hive Machine learning Data processing MapReduce Administrative and development tools Web console Monitor cluster health, jobs, etc. Add / remove nodes Start / stop services Inspect job status Inspect workflow status Deploy applications Launch apps / jobs Work with distrib file system Work with spreadsheet interface Support REST-based API... Adaptive MapReduce Flexible scheduler GPFS FPO HCatalog HDFS Eclipse tools Connectivity and Integration JDBC Sqoop DB2 Netezza Streams R Text analytics MapReduce programming Jaql, Hive, Pig development BigSheets plug-in development Oozie workflow generation Flume Data Explorer Guardium DataStage Cognos BI 29
BigSheets - Spreadsheet-style Analysis on Hadoop Web-based analysis and visualization of big data Familiar paradigm designed for business users Spreadsheet-like interface Define and manage long running data collection jobs Analyze content of the text on the pages that have been retrieved 30
Big SQL standard SQL access into Hadoop (BigInsights) Standard SQL syntax and data types Joins, unions, aggregates, etc. VARCHAR, decimal, TIMESTAMP, JDBC/ODBC drivers Prepared statements Cancel support Database metadata API support Secure socket connections (SSL) Optimization MapReduce parallelism or Local access for low-latency queries Varied storage mechanisms appropriate for Hadoop ecosystem Integration Eclipse tools DB2, Netezza, Teradata (via LOAD) Cognos Business Intelligence 31... 2014 IBM 2013 Corporation IBM Corporation
BigInsights and Text Analytics Distills structured info from unstructured text Sentiment analysis Consumer behavior Illegal or suspicious activities Parses text and detects meaning with annotators Understands the context in which the text is analyzed Unstructured text (document, email, etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Features pre-built extractors for names, addresses, phone numbers, etc. Built-in support for English, Spanish, French, German, Portuguese, Dutch, Japanese, 32 Chinese Classification and Insight
Hadoop Appliance for the Landing Zone From custom and complex To organized simplicity Visualization HDFS HCatalog MapReduce Simplify the building, deploying and management of a Hadoop cluster Pig Hive Designed to Development Tools Speed the time-to-value for Hadoop and unstructured data Maximize the overall analytic ecosystem Provide enterprise security and platform management System for Hadoop 33
Search and Exploration 34
Data Explorer Discover, Explore, Understand Data Explorer allows in- place analysis and correla)on of Big Data assets Web RSS Feed Social Media Content Mgt Unstructured Data Systems Enterprise Unstructured Sources Unstructured Databases Data Warehouses SCM SOA, ESB, Web Service Enterprise Systems & Structured Data Stores Structured 35
Data Explorer Search Architecture with High Performance Index Publish Search Results User Profiles Display Templates, robust transformation, XML feed Clustering Engine Subscriptions Content Integration Query transformation & federation RSS/License Feeds Federated Sources Knowledge Base Thesauri Acronyms Ontology Support Semantic Processing Search Engine Web Results Content, Document, Record Mgt. Systems Databases RSS/License Feeds Collaboration Systems Email and Email Archives Internet (Web) CRM Systems File Systems 36
InfoSphere Data Explorer provides real-time access and fusion of big data for unlocking greater insight Scalability Can analyze trillions of records, leveraging a resilient infrastructure with enterpriseclass features Extreme capacity to analyze all types of Big Data assets Structured, Unstructured, Semi-Structured, Social Media, Web Content, Legacy applications, Enterprise Systems (Siebel, SAP, SharePoint) and more Security Integrated technology to align with existing Big Data governance models Security profiles of the underlying systems are respected so that users can only see and can analyze information for which they are authorized Accuracy / Relevancy Provides the highest level of accuracy & relevancy for analyzing Big Data assets Unique position-based index technology helps users quickly locate, reveal & explore Big Data content relationships Integration Leave data in place to creating a virtual single repository for Big Data exploration & discovery Ability to connect to CRMs, ERPs, ECMs, Web Content, Twitter, Facebook and thousands more assets 37
Access across many sources Dynamic categorization Expertise location Leveraging Structured and unstructured content Highly relevant, personalized results Refinements based on structured information Tagging and collaboration Virtual folders for organizing content 38
Governance and Organization 39
Optimizing Information Governance with Information Server Enterprise Data Models Exchange Data Structures Services Oriented Architecture (SOA) Data Architect Link Information Services Director Populate Deploy Common Enterprise Vocabulary Search and Profile Source Data Map Sources to Target Model Transform and Cleanse Business Glossary Information Analyzer FastTrack DataStage and QualityStage Share Share Share Share 40 Metadata Server and Metadata Workbench Active Cross-Platform Administration, Management and Reporting
Information Governance Dashboard to Visualize and Control Governance Innovation Indicators for policies and KPIs Rapid creation of tailored dashboards Value Immediate insight into governance policy status Interception of issues when they start, right at the source Usage Raises data confidence with visual governance status 41
Big Data Privacy and Security Protect a Wider Variety of Sources Innovation Data activity monitoring of more NoSQL, Hadoop, and Relational Systems Masking of sensitive data used in Hadoop Agile Governance Value Protection is a pre-requisite for the fundamental assumption of big data sharing data for new insight Automation enables protection without inhibiting speed InfoSphere Guardium InfoSphere Optim Usage Ensures sensitive data is protected and secure 42
IBM s Data Governance Unified Process 1) Define Business Problem 2) Obtain Executive Sponsorship 3) Conduct Maturity Assessment 4) Build Roadmap 5) Establish Organizational Blueprint 6) Build Data Dictionary 7) Understand Data 8) Create Metadata Repository 9) Define Metrics 10.1) Appoint Data Stewards 10.2) Manage Data Quality 10.3) Implement Master Data Management Master Data Governance 11) Govern Analytics 12) Manage Security & Privacy 13) Manage Lifecycle of Information 14) Measure Results 43
You should consider creating an Analytics Center of Excellence (ACE) Capture and disseminate best practices Advise and consult on projects Promote user adoption Maintain the Analytics Architecture Maintain consistent toolset ACE helps with Maximizes the quality, efficiency and application of analytics across all lines of business, resulting in greater confidence and consistency in decisionmaking Leads to a higher success rate for business analytics deployments, delivering more value at less cost and in less time Drives end user adoption, leading to a smoother path to improved outcomes Provides a formal organizational structure, enabling your organization to strike the right balance between agility and sound management in deploying analytics technologies Eliminates the gap between Business and IT, improving time-to-market and responsiveness to change ACE 44
THINK Deepak Mohapatra Sr. Consultant IBM Software Group dmohapatra@us.ibm.com 45