GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
AGENDA Demystifying Big Data Data Virtualization: Making Big Data Available to Everyone Red Hat Big Data Strategy and Platform Real World Customer Example using Red Hat Big Data Platform Demo Roadmap Q&A
DO WE AGREE ON WHAT BIG DATA IS?
Source: http://blogs.ifsworld.com/2013/02/how-will-big-data-influence-your-finance-team/
IT S ALL ABOUT GAINING BUSINESS INSIGHTS Improve product development Optimize business processes Improve customer care Improve customer lifetime value Personalize products Competitive intelligence
INFORMATION AND AGILITY GAP Only 28% Users have any meaningful data access 65% Constantly changing business needs Over 70% BI project efforts lies in Data Integration finding and identifying source data 57% IT s inability to satisfy new requests in a timely manner 54% The need to be a more analyticsdriven organization 47% Slow and untimely access to information 34% Business user dissatisfaction with IT-delivered BI capabilities
DATA CHALLENGES GETTING BIGGER FOR USERS NoSQL Pig Hive MapReduce HDFS Storm Flume HBase Jaql
RED HAT S BIG DATA STRATEGY Reduce Information Gap thru cost effectively making ALL data easily consumable for analytics Data Capture Process Integrate Data to Actionable Information Cycle Analytics
BIG DATA FOR EVERYONE
EASY ACCESS TO BIG DATA BI Reports & Analytics Analytical Reporting Tool Data Virtualization Server 1. Reporting tool accesses the data virtualization server via rich SQL dialect 2. The data virtualization server translates rich SQL dialect to HiveQL Hadoop Hive MapReduce HDFS 3. Hive translates SQL to MapReduce 4. MapReduce runs MR job on big data Big Data
Data Sources JBoss Data Virtualization Data Consumers TURN FRAGMENTED DATA INTO ACTIONABLE INFORMATION BI Reports & Analytics Mobile Applications ESB, ETL SOA Applications & Portals Easy, Real-time Information Access Consume Standard based Data Provisioning JDBC, ODBC, REST, SOAP, OData Design Tools Dashboard Compose Unified Virtual Database / Common Data Model Data Transformations Optimization Caching Virtualize Transform Federate Connect Native Data Connectivity Security Metadata Siloed & Complex Hadoop NoSQL Cloud Apps Data Warehouse & Databases Mainframe XML, CSV & Excel Files Enterprise Apps
BENEFITS OF DATA VIRTUALIZATION ON BIG DATA Enterprise democratization of big data Any reporting or analytical tool can be used Easy access to big data Seamless integration of big data and existing data assets Sharing of integration specifications Collaborative development on big data Fine-grained of security big data Increased time-to-market of reports on big data
CONVERGENCE OF FOUR DATA TRENDS Big Structured Data Transactional & Analytical Big Streaming Data Events & Messages Big Data Integration Big Data Processing Hadoop Big Unstructured Data Social & Interactions
Capture & Process Integrate & Analyze COMPREHENSIVE MIDDLEWARE PLATFORM CAPTURE, PROCESS AND INTEGRATE BIG DATA VOLUME, VELOCITY, VARIETY BI Analytics (historical, operational, predictive) SOA Composite Applications Data Integration JBoss Data Virtualization Messaging and Event Processing JBoss A-MQ and JBoss BRMS J In-memory Cache JBoss Data Grid Hadoop Red Hat Storage Red Hat Enterprise Linux & Virtualization Structured Data Streaming Data Semi-Structured Data
RED HAT BIG DATA PLATFORM JBoss Data Virtualization Integration Software JBoss BRMS JBoss A-MQ JBoss Data Grid Infrastructure Software Red Hat Storage Red Hat Enterprise Virtualization Red Hat Enterprise Linux
EXAMPLES: RED HAT BIG DATA PLATFORM IN THE REAL WORLD
BIG DATA IN THE UTILITIES Objective: Combine data from smart meters on homes with data from electricity generation and transmission and make it available to power providers Problem: The original smart grid project looked only at reading information from the meters on houses and now this data needs to be combined with generation and transmission data in a cost-effective way The data points are all over the place: sensors on the lines, in the field, homes, etc. The information must be accessible to multiple power providers through a common interface Solution: Use Messaging to collect data from a variety of sources and route it to a CEP for initial filtering. Process with Hadoop map/reduce and BRMS and distribute data to Data Virtualization to be combined with other sources and consumed with BI tools, and/or to JDG for in-memory data caching and/or send to archive.
SMART GRID PM Data Schedule PM Data Reports PM Admin Compose Regulatory Users Authentication Presentation REST Exposure API Exposure &Portal Tier Rules Creation / Updates Offline Storage Data Virtualization Cache Data Tier Normalization / MapReduce NoSQL-Cassandra PM Regional Translator / Scheduler Normalized Data Tier Adaptor Rules Sensor Adaptor Routing Function Data Adaptation & Routing Tier Collector Sensors Local Data Store Collector Scada Local Data Store Collector Meter Local Data Store Element Connection Tier Transmission Generation Consumer
RETAIL CUSTOMER USE CASE GAIN BETTER INSIGHT FOR INTELLIGENT INVENTORY MANAGEMENT Analytical Apps Objective: Right merchandise, at right time and price Problem: JBoss BRMS Data Driven Decision Management Cannot utilize social data and sentiment analysis with their inventory and purchase management system Solution: Consume Compose Connect Leverage JBoss Data Virtualization to mashup Sentiment analysis data with inventory and purchasing system data. Leveraged BRMS to optimize pricing and stocking decisions. Inventory Databases JBoss Data Virtualization Hive Sentiment Analysis Purchase Mgmt Application
DEMOS LUCIDWORKS, JBOSS DATA VIRTUALIZATION AND RED HAT STORAGE
ABOUT LUCIDWORKS Employs 40% of the committers for Lucene/Solr Makes 50% - 70% of the enhancements to each release of Lucene/Solr Only company to offer Open Source and Open Core Search Solutions
LUCENE/SOLR: ENABLING BETTER, DATA-DRIVEN DECISIONS
LUCIDWORKS DEMONSTRATION LucidWorks/Solr to provide full text search and statistics Data Virtualization provides the data through Teiid JDBC driver and pulls the data from Hive/Hadoop, CSV File, XML File Red Hat Storage provides the Enterprise Data Repository
DEMONSTRATION ARCHITECTURE
DEMOS HORTONWORKS AND JBOSS DATA VIRTUALIZATION
ABOUT HORTONWORKS Founded in 2011 by 24 engineers from the original Yahoo! Hadoop development and operations team Hortonworks drive innovation in the open exclusively via the Apache Software Foundation process Hortonworks is responsible for around 50% of core code base advances to Apache Hadoop
HORTONWORKS DATA PLATFORM 2 SANDBOX Enterprise Ready YARN, the Hadoop Operating System Stinger Phase 2; Interactive SQL Queries at Petabyte Scale Reliable NoSQL IN Hadoop with Hbase Technical Specs Component Version Apache Hadoop 2.2.0 Apache Hive 0.12.0 Apache HCatalog 0.12.0 Apache HBase 0.96.0 Apache ZooKeeper 3.4.5 Apache Pig 0.12.0 Apache Sqoop 1.4.4 Apache Flume 1.4.0 Apache Oozie 4.0.0 Apache Ambari 1.4.1 Apache Mahout 0.8.0 Hue 2.3.0
HORTONWORKS DV Dashboard to analyze the aggregated data by User Role DEMONSTRATION Objective: Secure data according to Role for row level security and Column Masking Problem: Cannot hide region data such as patient data from region specific users Consume Compose Connect JBoss Data Virtualization Solution: Leverage JBoss Data Virtualization to provide Row Level Security and Masking of columns Hive SOURCE 1: Hive/Hadoop in the HDP contains US Region Data Hive SOURCE 2: Hive/Hadoop in the HDP contains EU Region Data
HORTONWORKS Excel Powerview and DV Dashboard to analyze the aggregated data DEMONSTRATION Objective: Determine if sentiment data from the first week of the Iron Man 3 movie is a predictor of sales Problem: Cannot utilize social data and sentiment analysis with sales management system Consume Compose Connect JBoss Data Virtualization Solution: Leverage JBoss Data Virtualization to mashup Sentiment analysis data with ticket and merchandise sales data on MySQL into a single view of the data. Hive SOURCE 1: Hive/Hadoop contains twitter data including sentiment SOURCE 2: MySQL data that includes ticket and merchandise sales
DEMONSTRATION SYSTEM REQUIREMENTS JDK Oracle JDK 1.6, 1.7 or OpenJDK 1.6 or 1.7 JBoss Data Virtualization v6 Beta http://jboss.org/products/datavirt.html JBoss Developer Studio http://jboss.org/products JBoss Integration Stack Tools (Teiid) https://devstudio.jboss.com/updates/7.0-development/integration-stack/ Slides, Code and References for demo https://github.com/datavirtualizationbyexample/mashup-with-hive-and- MySQL Hortonworks Data Platform (A VM for testing Hive/Hadoop) http://hortonworks.com/products/hdp-2/#install Red Hat Storage http://www.redhat.com/products/storage-server/
JBOSS DATA VIRTUALIZATION PRODUCT ROADMAP AND BIG DATA
WHAT COMING: JBOSS DATA VIRTUALIZATION 6.1 Big Data Cloud Deployment Productivity Full connectivity support for: MongoDB Cloudera Impala Apache Solr Tech Preview Cassandra Accumulo Alpha availability on OpenShift Support for: Amazon RedShift Amazon SimpleDB Security audit log in Dashboard builder Improved usability for custom translator EAP 6.3 support RHEL 7 support MariaDB Azul JVM support
BENEFITS OF DATA VIRTUALIZATION ON BIG DATA Enterprise democratization of big data Any reporting or analytical tool can be used Easy access to big data Seamless integration of big data and existing data assets Sharing of integration specifications Collaborative development on big data Fine-grained of security big data Increased time-to-market of reports on big data
WHY RED HAT FOR BIG DATA? Transform ALL data into actionable information Cost Effective, Comprehensive Platform Community based Innovation Enterprise Class Software and Support Data Capture Process Integrate Data to Actionable Information Cycle Analytics
THANK YOU Q & A