EMC Federation Big Data Solutions 1
Introduction to data analytics Federation offering 2
Traditional Analytics! Traditional type of data analysis, sometimes called Business Intelligence! Type of analytics done for a predefined purpose eg. Reporting! Traditional, internal data sources! Tends to be backwards looking! Technology examples: Pivotal GreenPlum Oracle DB IBM DB2 MySQL 3
GreenPlum DB! PostgreSQL based relational database engine! Capable of massively parallel processing! Available as software based solution from Pivotal or as an appliance based solution from EMC 4
GreenPlum utilizes MPP architecture 5
Big Data! Data that is generated with great Velocity, has a Variety of types and too large in Volume that makes it hard or impossible to analyze using traditional methods and technologies.! Best used for exploratory analytics and transformations of large volumes of data (Store first, ask questions later)! Uses a multitude of data sources. Including internal, external, social media and streams.! Data can be structured, unstructured or semi-structured and if necessary, transformed during analysis.! Analytics tends to be predictive! Technology examples: Hadoop (Pivotal HD, Hortonworks, Cloudera etc.) Structured (20%) Semi-structured Unstructured Column oriented data with machine readable structure. XML, Email etc. Data that seemingly has a structure but needs to be transformed for analysis Photo, Video etc. Data with no clear structure that needs to be transformed for analysis 6
Why We Love Hadoop! Flexible! Scalable! Inexpensive! Fault-tolerant! Rapidly Adopted 7
What We Wish For, In Addition! Ease of Provisioning and Management! Plug-In Support for Ecosystem of Tools! Elasticity of Storage and Compute! Improved Data Management! Interactive Query Response! True SQL Query Interface! Security controls 8
Our Hadoop Architecture 9
Core Hadoop Components HDFS The Hadoop Distributed File System acts as the storage layer for Hadoop MapReduce Parallel processing framework used for data computation in Hadoop Hive Structured, data warehouse implementation for data in HDFS that provides a SQL-like interface to Hadoop Sqoop Batch database to Hadoop data transfer framework Pig High-level procedural language for data pipeline/data flow processing in Hadoo. Pig Latin syntax. HBase NoSQL, key-value data store on top of HDFS Mahout Library of scalable machinelearning Algorithms Spring Hadoop Integrates the Spring framework into Hadoop Flume Data collection loading utility. Logs etc. 10
Pivotal HD Enterprise with ADS (HAWQ) In addition to the core Hadoop components, Pivotal HD is focused on delivering the Enterprise-class features that are required by our target customers and prospects. These features drive data-working productivity, enable massively-parallel data loading, support enterprise-grade storage options, and can be deployed in virtualized environments. Pivotal HD Enterprise Includes Core Hadoop Components Command Center visual interface for cluster health, system metrics, and job monitoring. Hadoop Virtualization Extension (HVE) enhances Hadoop to support virtual node awareness and enables greater cluster elasticity. Data Loader parallel loading infrastructure that supports line speed data loading into HDFS. Isilon Integration extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations. Spark In-memory datastore used for ingesting streams to HDFS Adds the Following to Pivotal HD Enterprise Advanced Database Services (HAWQ) highperformance, True SQL query interface running within the Hadoop cluster. Xtensions Framework support for ADS interfaces on external data providers (HBase, Avro, etc.). Advanced Analytics Functions (MADLib) ability to access parallelized machine-learning and datamining functions at scale. Unified Storage Services (USS) and Unified Catalog Services (UCS) support for tiered storage (hot, warm, cold) and integration of multiple data provider catalogs into a single interface. 11
HAWQ: The crown gems of GreenPlum MPP DBMS on Hadoop/HDFS Out of the box true ANSI SQL for Hadoop ACID compatible High-Performance Query Processing Multi-petabyte scalability Interactive and true ANSI SQL support Enterprise-Class Database Services Column storage and indexes Workload Management Allows consolidating your BI and Big Data analytics environment 12
This Changes Everything! Leverage existing SQL skillsets for Hadoop! TRUE SQL interfaces for data workers and data tools! Broad range of data format support operate on data-in-place or optimize for query response time! Single Hadoop infrastructure for Big Data exploration AND analysis! ODBC/JDBC API enables the effortless usage of analytic tools such as SAS, Tableau and R with Hadoop 13
HAWQ Benchmarks User intelligence! 4.2! 198! Sales analysis! 8.7! 161! Click analysis! 2.0! 415! Data exploration! 2.7! 1,285! BI drill down! 2.8! 1,815! 47X 19X 208X 476X 648X 14
Xtension Framework Gives you ability to read different data / file types from HDFS using HawQ SQL interface and staticstical functions HBase data Hive data Native Data on HDFS in a variety of formats Map Reduce, Hive, HBase The Great Divide SQL Join in-database dimensions with other fact tables HDFS Files RDBMS Files Fast ingest of data into SQL native format (insert into select * from ) HDFS Extensible API Reduced need for ETL processing on SQL analytics 15
Hadoop also utilizes MPP architecture 16
Fast Data! Data that requires fast reaction at the time of creation! Used with stream ingestion to HDFS platform and event handling catch it on the fly! Great for application that are heavily transactional! Many technologies allow long term persistence to HDFS to mine additional long term value (trends, patterns etc.)! Technology examples: Pivotal GemFire Apache Spark 17
Pivotal GemFire Data management grid architecture Data Fabric Very Fast In-Memory data cache Highly optimized read and write throughput 100-1000X greater performance than traditional disk-based databases Very Scalable Vertical scaling multiple instance per server Horizontal scaling multiple instance across multiple servers/lan/wan 18
Gemfire Architecture 19
DiskStores For persistence Disk Stores support saving in-memory data to storage Persistence Used to store redundant copy of data Cached Region, Gateway Sender Queues, PDX serialization metadata Overflow Used as an extension of the inmemory cache Overflow used to expand the memory capabilities of the region using disks storage Cached Region, Gateway Sender Queues, Server subscription queues Disk Stores serialization metadata Cached Region Gateway Sender Queues Server subscription queues Persistence Overflow 20
Gemfire functions Functions are GemFire s equivalent to database stored procedures Execute business logic that is co-located with data in-memory Fastest data access patterns Functions can be made asynchronous by setting the hasresult flag to false and not returning a value. Function can be executed programmatically or manually through Gfsh Execution Types Client JVM function Locators JVM function JVM function OnRegion execute on region/partition Executing code in the exact node where a specific key resides in a partitioned Region OnServers execute on all servers in a pool Executing code simultaneously on all nodes OnServer execute on a single server in a pool 21
GemFire EventListeners GemFire supports synchronous or asynchronous event management with one or more configured listeners Event handlers are synchronous. They register their interest in one or more events and are notified when the events occur. If you need to change the cache or perform any other distributed operation from event handler callbacks, be careful to avoid activities that might block and affect your overall system performance. AsyncEvent Listener instances are serviced by its own dedicated thread asynchronously on the Cache Server using the following two deployment options; A serial queue is deployed to one GemFire member, and it delivers all of a region's events in order to a configured AsyncEventListener implementation. A parallel queue is deployed to multiple GemFire members, and each instance of the queue simultaneously delivers region events to a local AsyncEventListener implementation. 22
Full analytics stack can be deployed to private or public cloud, or private infrastructure 23
Big Data Suite! Is a subscription based model from Pivotal! Allows the usage of all Pivotal data analytics tools and cloudfoundry with a single license 24
External HDFS Storage 25
HDFS storage on Isilon " Highly scalable. Multiple Petabytes. " Scale-out infrastructure allows scaling capacity separate from compute " No single point of failure " Less data protection overhead then traditional DAS Hadoop " Allows consolidation of production data, analytical data and data archival to single platform Copyright 2014 EMC Corporation. All rights reserved. 26
Isilon vs Traditional Hadoop datanodes Copyright 2014 EMC Corporation. All rights reserved. 27
Data Lake Store everything, Analyze anything, Build what you need 28
What is a data lake? Centralized analysis architecture and repository designed to allow business units to: Store everything (in Isilon HDFS) Analyze anything, meaning any file type, any sources, any time. (with Pivotal HD, Gemfire, Greenplum and other tools) Build what you need, in terms of applications that utilize big data. (Pivotal software tools eg. Pivotal CF, Spring HD etc.) And therefore discover additional business value from the data generated by their day to day business activities Copyright 2014 EMC Corporation. All rights reserved. 29
Open Data Platform Initiative Taking Big Data Forward 30
Key Goals 1. Accelerate the delivery of Big Data solutions by providing a well-defined core platform 2. Define, integrate, test, and certify a standard "ODP Core" of compatible versions of select Big Data open source projects. 3. Provide a stable base against which Big Data solutions providers can qualify solutions. 4. Produce a set of tools and methods that enable members to create and test differentiated offerings based on the ODP Core. 5. Reinforce the role of the Apache Software Foundation (ASF) in the development and governance of upstream projects. 6. Contribute to ASF projects in accordance with ASF processes and Intellectual Property guidelines. 7. Support community development and outreach activities that accelerate the rollout of modern data architectures that leverage Apache Hadoop. 8. Will help minimize the fragmentation and duplication of effort within the industry. Copyright 2014 EMC Corporation. All rights reserved. 31
Open Data Platform Standard Hadoop core (ODP Core) between participants Make additional components interoperable between different Hadoop distributions Prevent duplicate development effort and fragmentation on the platorm Accelerate development Contribute to ASF Hadoop project Copyright 2014 EMC Corporation. All rights reserved. 32
Pivotal code contributions to ODP so far Copyright 2014 EMC Corporation. All rights reserved. 33
ODP participators and contributors Platinum members: GE Hortonworks IBM Pivotal Infosys SAS International Telco Gold members: Capgemini EMC Vmware wandisco Altiscale Teradata Verizon CenturyLink PLDT Splunk Copyright 2014 EMC Corporation. All rights reserved. 34
Ambari, the new UI for Pivotal HD from ODP collaboration 35