EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.



Similar documents
Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Native Connectivity to Big Data Sources in MSTR 10

Big Data and the Data Lake. February 2015

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

The Technology of the Business Data Lake

The Future of Data Management

Big Data Analytics - Accelerated. stream-horizon.com

Bringing Big Data to People

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

HDP Hadoop From concept to deployment.

Big Data Technologies Compared June 2014

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

The Technology of the Business Data Lake

Tap into Hadoop and Other No SQL Sources

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

WHAT S NEW IN SAS 9.4

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Comprehensive Analytics on the Hortonworks Data Platform

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB

and Hadoop Technology

Big Data Management and Security

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Big Data and Data Science: Behind the Buzz Words

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Trafodion Operational SQL-on-Hadoop

Data processing goes big

INTRODUCING APACHE IGNITE An Apache Incubator Project

How To Scale Out Of A Nosql Database

Deploying Hadoop with Manager

In-Memory Analytics for Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

<Insert Picture Here> Big Data

Hadoop. Sunday, November 25, 12

Hadoop Ecosystem B Y R A H I M A.

BIG DATA-AS-A-SERVICE

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Next-Generation Cloud Analytics with Amazon Redshift

Upcoming Announcements

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Performance and Scalability Overview

Reference Architecture, Requirements, Gaps, Roles

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

HDP Enabling the Modern Data Architecture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Analytics Nokia

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

The Future of Data Management with Hadoop and the Enterprise Data Hub

Constructing a Data Lake: Hadoop and Oracle Database United!

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Protecting Big Data Data Protection Solutions for the Business Data Lake

#TalendSandbox for Big Data

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Performance and Scalability Overview

Virtualizing Apache Hadoop. June, 2012

I/O Considerations in Big Data Analytics

Advanced In-Database Analytics

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

Bringing the Power of SAS to Hadoop. White Paper

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Proact whitepaper on Big Data

VIEWPOINT. High Performance Analytics. Industry Context and Trends

The Enterprise Data Hub and The Modern Information Architecture

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

How Cisco IT Built Big Data Platform to Transform Data Management

Fundamentals Curriculum HAWQ

An Oracle White Paper October Oracle: Big Data for the Enterprise

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Luncheon Webinar Series May 13, 2013

Apache Hadoop: The Big Data Refinery

BIG DATA What it is and how to use?

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Hadoop & Spark Using Amazon EMR

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop implementation of MapReduce computational model. Ján Vaňo

Transcription:

EMC Federation Big Data Solutions 1

Introduction to data analytics Federation offering 2

Traditional Analytics! Traditional type of data analysis, sometimes called Business Intelligence! Type of analytics done for a predefined purpose eg. Reporting! Traditional, internal data sources! Tends to be backwards looking! Technology examples: Pivotal GreenPlum Oracle DB IBM DB2 MySQL 3

GreenPlum DB! PostgreSQL based relational database engine! Capable of massively parallel processing! Available as software based solution from Pivotal or as an appliance based solution from EMC 4

GreenPlum utilizes MPP architecture 5

Big Data! Data that is generated with great Velocity, has a Variety of types and too large in Volume that makes it hard or impossible to analyze using traditional methods and technologies.! Best used for exploratory analytics and transformations of large volumes of data (Store first, ask questions later)! Uses a multitude of data sources. Including internal, external, social media and streams.! Data can be structured, unstructured or semi-structured and if necessary, transformed during analysis.! Analytics tends to be predictive! Technology examples: Hadoop (Pivotal HD, Hortonworks, Cloudera etc.) Structured (20%) Semi-structured Unstructured Column oriented data with machine readable structure. XML, Email etc. Data that seemingly has a structure but needs to be transformed for analysis Photo, Video etc. Data with no clear structure that needs to be transformed for analysis 6

Why We Love Hadoop! Flexible! Scalable! Inexpensive! Fault-tolerant! Rapidly Adopted 7

What We Wish For, In Addition! Ease of Provisioning and Management! Plug-In Support for Ecosystem of Tools! Elasticity of Storage and Compute! Improved Data Management! Interactive Query Response! True SQL Query Interface! Security controls 8

Our Hadoop Architecture 9

Core Hadoop Components HDFS The Hadoop Distributed File System acts as the storage layer for Hadoop MapReduce Parallel processing framework used for data computation in Hadoop Hive Structured, data warehouse implementation for data in HDFS that provides a SQL-like interface to Hadoop Sqoop Batch database to Hadoop data transfer framework Pig High-level procedural language for data pipeline/data flow processing in Hadoo. Pig Latin syntax. HBase NoSQL, key-value data store on top of HDFS Mahout Library of scalable machinelearning Algorithms Spring Hadoop Integrates the Spring framework into Hadoop Flume Data collection loading utility. Logs etc. 10

Pivotal HD Enterprise with ADS (HAWQ) In addition to the core Hadoop components, Pivotal HD is focused on delivering the Enterprise-class features that are required by our target customers and prospects. These features drive data-working productivity, enable massively-parallel data loading, support enterprise-grade storage options, and can be deployed in virtualized environments. Pivotal HD Enterprise Includes Core Hadoop Components Command Center visual interface for cluster health, system metrics, and job monitoring. Hadoop Virtualization Extension (HVE) enhances Hadoop to support virtual node awareness and enables greater cluster elasticity. Data Loader parallel loading infrastructure that supports line speed data loading into HDFS. Isilon Integration extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations. Spark In-memory datastore used for ingesting streams to HDFS Adds the Following to Pivotal HD Enterprise Advanced Database Services (HAWQ) highperformance, True SQL query interface running within the Hadoop cluster. Xtensions Framework support for ADS interfaces on external data providers (HBase, Avro, etc.). Advanced Analytics Functions (MADLib) ability to access parallelized machine-learning and datamining functions at scale. Unified Storage Services (USS) and Unified Catalog Services (UCS) support for tiered storage (hot, warm, cold) and integration of multiple data provider catalogs into a single interface. 11

HAWQ: The crown gems of GreenPlum MPP DBMS on Hadoop/HDFS Out of the box true ANSI SQL for Hadoop ACID compatible High-Performance Query Processing Multi-petabyte scalability Interactive and true ANSI SQL support Enterprise-Class Database Services Column storage and indexes Workload Management Allows consolidating your BI and Big Data analytics environment 12

This Changes Everything! Leverage existing SQL skillsets for Hadoop! TRUE SQL interfaces for data workers and data tools! Broad range of data format support operate on data-in-place or optimize for query response time! Single Hadoop infrastructure for Big Data exploration AND analysis! ODBC/JDBC API enables the effortless usage of analytic tools such as SAS, Tableau and R with Hadoop 13

HAWQ Benchmarks User intelligence! 4.2! 198! Sales analysis! 8.7! 161! Click analysis! 2.0! 415! Data exploration! 2.7! 1,285! BI drill down! 2.8! 1,815! 47X 19X 208X 476X 648X 14

Xtension Framework Gives you ability to read different data / file types from HDFS using HawQ SQL interface and staticstical functions HBase data Hive data Native Data on HDFS in a variety of formats Map Reduce, Hive, HBase The Great Divide SQL Join in-database dimensions with other fact tables HDFS Files RDBMS Files Fast ingest of data into SQL native format (insert into select * from ) HDFS Extensible API Reduced need for ETL processing on SQL analytics 15

Hadoop also utilizes MPP architecture 16

Fast Data! Data that requires fast reaction at the time of creation! Used with stream ingestion to HDFS platform and event handling catch it on the fly! Great for application that are heavily transactional! Many technologies allow long term persistence to HDFS to mine additional long term value (trends, patterns etc.)! Technology examples: Pivotal GemFire Apache Spark 17

Pivotal GemFire Data management grid architecture Data Fabric Very Fast In-Memory data cache Highly optimized read and write throughput 100-1000X greater performance than traditional disk-based databases Very Scalable Vertical scaling multiple instance per server Horizontal scaling multiple instance across multiple servers/lan/wan 18

Gemfire Architecture 19

DiskStores For persistence Disk Stores support saving in-memory data to storage Persistence Used to store redundant copy of data Cached Region, Gateway Sender Queues, PDX serialization metadata Overflow Used as an extension of the inmemory cache Overflow used to expand the memory capabilities of the region using disks storage Cached Region, Gateway Sender Queues, Server subscription queues Disk Stores serialization metadata Cached Region Gateway Sender Queues Server subscription queues Persistence Overflow 20

Gemfire functions Functions are GemFire s equivalent to database stored procedures Execute business logic that is co-located with data in-memory Fastest data access patterns Functions can be made asynchronous by setting the hasresult flag to false and not returning a value. Function can be executed programmatically or manually through Gfsh Execution Types Client JVM function Locators JVM function JVM function OnRegion execute on region/partition Executing code in the exact node where a specific key resides in a partitioned Region OnServers execute on all servers in a pool Executing code simultaneously on all nodes OnServer execute on a single server in a pool 21

GemFire EventListeners GemFire supports synchronous or asynchronous event management with one or more configured listeners Event handlers are synchronous. They register their interest in one or more events and are notified when the events occur. If you need to change the cache or perform any other distributed operation from event handler callbacks, be careful to avoid activities that might block and affect your overall system performance. AsyncEvent Listener instances are serviced by its own dedicated thread asynchronously on the Cache Server using the following two deployment options; A serial queue is deployed to one GemFire member, and it delivers all of a region's events in order to a configured AsyncEventListener implementation. A parallel queue is deployed to multiple GemFire members, and each instance of the queue simultaneously delivers region events to a local AsyncEventListener implementation. 22

Full analytics stack can be deployed to private or public cloud, or private infrastructure 23

Big Data Suite! Is a subscription based model from Pivotal! Allows the usage of all Pivotal data analytics tools and cloudfoundry with a single license 24

External HDFS Storage 25

HDFS storage on Isilon " Highly scalable. Multiple Petabytes. " Scale-out infrastructure allows scaling capacity separate from compute " No single point of failure " Less data protection overhead then traditional DAS Hadoop " Allows consolidation of production data, analytical data and data archival to single platform Copyright 2014 EMC Corporation. All rights reserved. 26

Isilon vs Traditional Hadoop datanodes Copyright 2014 EMC Corporation. All rights reserved. 27

Data Lake Store everything, Analyze anything, Build what you need 28

What is a data lake? Centralized analysis architecture and repository designed to allow business units to: Store everything (in Isilon HDFS) Analyze anything, meaning any file type, any sources, any time. (with Pivotal HD, Gemfire, Greenplum and other tools) Build what you need, in terms of applications that utilize big data. (Pivotal software tools eg. Pivotal CF, Spring HD etc.) And therefore discover additional business value from the data generated by their day to day business activities Copyright 2014 EMC Corporation. All rights reserved. 29

Open Data Platform Initiative Taking Big Data Forward 30

Key Goals 1. Accelerate the delivery of Big Data solutions by providing a well-defined core platform 2. Define, integrate, test, and certify a standard "ODP Core" of compatible versions of select Big Data open source projects. 3. Provide a stable base against which Big Data solutions providers can qualify solutions. 4. Produce a set of tools and methods that enable members to create and test differentiated offerings based on the ODP Core. 5. Reinforce the role of the Apache Software Foundation (ASF) in the development and governance of upstream projects. 6. Contribute to ASF projects in accordance with ASF processes and Intellectual Property guidelines. 7. Support community development and outreach activities that accelerate the rollout of modern data architectures that leverage Apache Hadoop. 8. Will help minimize the fragmentation and duplication of effort within the industry. Copyright 2014 EMC Corporation. All rights reserved. 31

Open Data Platform Standard Hadoop core (ODP Core) between participants Make additional components interoperable between different Hadoop distributions Prevent duplicate development effort and fragmentation on the platorm Accelerate development Contribute to ASF Hadoop project Copyright 2014 EMC Corporation. All rights reserved. 32

Pivotal code contributions to ODP so far Copyright 2014 EMC Corporation. All rights reserved. 33

ODP participators and contributors Platinum members: GE Hortonworks IBM Pivotal Infosys SAS International Telco Gold members: Capgemini EMC Vmware wandisco Altiscale Teradata Verizon CenturyLink PLDT Splunk Copyright 2014 EMC Corporation. All rights reserved. 34

Ambari, the new UI for Pivotal HD from ODP collaboration 35