Addressing Open Source Big Data, Hadoop, and MapReduce limitations



Similar documents
Exploiting Data at Rest and Data in Motion with a Big Data Platform

Industry Impact of Big Data in the Cloud: An IBM Perspective

Deploying Big Data to the Cloud: Roadmap for Success

Are You Ready for Big Data?

Are You Ready for Big Data?

How to Leverage Big Data in the Cloud to Gain Competitive Advantage

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Virtualizing Apache Hadoop. June, 2012

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

The Future of Data Management

Big Data and Trusted Information

Big Data Management and Security

Getting Started Practical Input For Your Roadmap

Real Time Big Data Processing

Modernizing Your Data Warehouse for Hadoop

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Information Builders Mission & Value Proposition

Customized Report- Big Data

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

How To Handle Big Data With A Data Scientist

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Luncheon Webinar Series May 13, 2013

Oracle Database 12c Plug In. Switch On. Get SMART.

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Testing Big data is one of the biggest

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

How the oil and gas industry can gain value from Big Data?

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

BAO & Big Data Overview Applied to Real-time Campaign GSE. Joel Viale Telecom Solutions Lab Solution Architect. Telecom Solutions Lab

Hadoop & Spark Using Amazon EMR

DeIC Watson Agreement - hvad betyder den for DeIC medlemmerne

Big Data Realities Hadoop in the Enterprise Architecture

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Transforming the Telecoms Business using Big Data and Analytics

Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights

Tap into Hadoop and Other No SQL Sources

Talend Big Data. Delivering instant value from all your data. Talend

The Retail Analytics Challenge: Smarter Retail through Advanced Analytics & Optimization

Azure Data Lake Analytics

Business Intelligence for Big Data

HDP Enabling the Modern Data Architecture

INDUSTRY BRIEF DATA CONSOLIDATION AND MULTI-TENANCY IN FINANCIAL SERVICES

Big Data and the new trends for BI and Analytics Juha Teljo Business Intelligence and Predictive Solutions Executive IBM Europe

Native Connectivity to Big Data Sources in MSTR 10

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Big Data Explained. An introduction to Big Data Science.

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Evolution from Big Data to Smart Data

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

BIG DATA What it is and how to use?

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Big Data on Microsoft Platform

How To Get An Advantage From Analytics

The New Normal: Get Ready for the Era of Extreme Information Management. John Mancini President, DigitalLandfill.

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

The Future of Data Management with Hadoop and the Enterprise Data Hub

Comprehensive Analytics on the Hortonworks Data Platform

So What s the Big Deal?

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Getting Started & Successful with Big Data

Hadoop IST 734 SS CHUNG

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Interactive data analytics drive insights

How To Make Data Streaming A Real Time Intelligence

Proact whitepaper on Big Data

Smarter Analytics. Barbara Cain. Driving Value from Big Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hadoop implementation of MapReduce computational model. Ján Vaňo

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

BIG DATA CHALLENGES AND PERSPECTIVES

Introduction to Cloud Computing

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Ironside Group Rational Solutions

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Informix The Intelligent Database for IoT

Hadoop: Embracing future hardware

Big Data Analytics Nokia

Architecting for the Internet of Things & Big Data

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

The Next Wave of Data Management. Is Big Data The New Normal?

Maximizing Hadoop Performance with Hardware Compression

Ubuntu and Hadoop: the perfect match

CDH AND BUSINESS CONTINUITY:

HDP Hadoop From concept to deployment.

Apache Hadoop: Past, Present, and Future

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Raul F. Chong Senior program manager Big data, DB2, and Cloud IM Cloud Computing Center of Competence - IBM Toronto Lab, Canada

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

YARN Apache Hadoop Next Generation Compute Platform

A New Era Of Analytic

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Transcription:

Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1

Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2

How Big are Data? 90 % of data in the world has been created last 2 years Data Media Content Machine Social IDC says the digital universe will represent 40 zettabytesin 2020 1 000 000 000 000 000 000 000 424 1 byte 4 3

yesterday Today And so what? The unseen information Hadoop NOSQL un-structured data Social Open data HDFS PIG, HIVE HBASE Hortonworks, Cloudera, Splunk 4

What can you do with big data? Financial Services Fraud detection Risk management 360 View of the Customer Utilities Weather impact analysis on power generation Transmission monitoring Smart grid management Transportation Weather and traffic impact on logistics and fuel consumption Health & Life Sciences Epidemic early warning system ICU monitoring Remote healthcare monitoring IT Transition log analysis for multiple transactional systems Cybersecurity Retail 360 View of the Customer Click-stream analysis Real-time promotions Telecommunications CDR processing Churn prediction Geomapping / marketing Network monitoring Law Enforcement Real-time multimodal surveillance Situational awareness Cyber security detection 5 5

6

Market definition Volume Velocity Variety 50x Information growing up to ZB Process information of 30 billions of RFID chips 80% of worldwide data are unstructured 2010 2020 Veracity 1/3 makers do not trust their information system to make decisions. 83% of CIO says analytics is the first path to competitiveness* *IBM CIO Study 7

Hadoop scope Hadoop focus on processing huge volume of different type of data 8 8

Do not confuse Big Data and Hadoop Bigdataisagenericterm: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications (http://en.wikipedia.org/wiki/big_data) Hadoop: Apache Hadoop is an open source framework that allows for distributed processing of large data sets across computing clusters and is the most widely used technology for Big Data processing. Hadoop framework has evolved into a set of tools and technologies to efficiently process, store and analyze huge amounts of varied data in a linear, scalable and reliable fashion. It is important to understand that Hadoop is not a complete replacement for the traditional enterprise Data Warehousing and Business Intelligence tools, but is a complementary approach to solve some of its challenges 9

Hadoop Distributed data processing Distributed data storage 10

Hadoop HDFS HDFS Client Name Node Data Node Data Node Data Node Data Node INPUT FILE Elastic Storage can replace HDFS 11

Hadoop Mapreduce INPUT SPLIT MAP SHUFFLE REDUCE RESULT Bear, 1 Bear, 1 Bear, 2 Deer Bear River Car Car River Deer Car Bear Deer Bear River Car Car River Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Car, 1 Car, 1 Car, 1 Deer, 1 Deer 1, Car, 3 Deer, 2 Bear, 2 Car, 3 Deer, 2 River, 2 Deer Car Bear Deer, 1 Car, 1 Bear, 1 River, 1 River, 1 River, 2 12

Is it HPC 2.0? It is HPDA Traditionally GOVT, EDU & RESEARCH Exotic and expensive BATCH Command line driven workloads Increasingly WIDESPREAD Accessible to all Entertainment, Finance, Gaming, xsps etc BATCH + REAL-TIME SOA API driven throughput and latency are key DEDICATED Homogeneous Infrastructure dedicated to specific applications STRUCTURED DATA problems involve mostly structured data DEDICATED + SHARED Heterogeneous infrastructure shared fluidly among groups and applications STRUCTURED + UNSTRUCTURED Data types unstructured or semi-structured email, video, documents, logs 13

Common pain points High-Availability Limited HA features in the workload engine HDFS NameNode lacks automatic failover logic Not scalable Large performance overhead during job initiation Scalability concerns No resources sharing Resource silos associated with MapReduce applications No way to manage a shared services model tied to an SLA Single purpose clusters - under utilized resources 14

Common pain points Lack of sophisticated scheduling engine Large jobs can still hog cluster resources Lack of real time resource monitoring Lack of granularity in priority management Lack of application life cycle / rolling upgrades No ability to run multiple Hadoop middleware in parallel No ability to manage/deploy multiple Hadoop versions 15

IBM Platform Symphony unique capabilities Multi-tenant 100% Hadoop Performances Scalable Robustness Production ready 16

Symphony brings unique capabilities to Big Data MapReduce Application 1 MapReduce Application 2 MapReduce Application 3 Other Workload Application 4 Job 1 Job 2 Job 1 Job 2 Job 1 Job 2 Job 1 Job 2 Job 3 Job N Job 3 Job N Job 3 Job N Job 3 Job N Job Tracker Job Tracker Job Tracker SSM Task Tracker Task Tracker Task Tracker SIM Platform Resource Orchestrator / Resource Monitoring Resource 1 Resource 2 Resource 1 Resource 2 Resource 1 Resource 2 Resource 1 Resource 2 Resource 3 Resource 4 Resource 3 Resource 4 Resource 3 Resource 4 Resource 3 Resource 4 Resource 5 Resource 6 Resource 5 Resource 6 Resource 5 Resource 6 Resource 5 Resource 6 Resource 7 Resource 8 Resource 7 Resource 8 Resource 7 Resource 8 Resource 7 Resource 8 Resource 9 Resource 10 Resource 9 Resource 10 Resource 9 Resource 10 Resource 9 Resource 10 Automated Resource Sharing 17

Multitenancy in IBM Platform Symphony Customers need to support diverse applications both Hadoop and non-hadoop Serial Batch MPI Parallel Distributed Workflow On-line SOA Hadoop MapReduce OpenStack Cloud Yarn Applications Platform Computing Resource Manager HDFS, Elastic Storage (reliable distributed storage) Cluster Management (Provisioning, management of private, hybrid and public cloud infrastructure) With sophisticated multitenancy, customers can share a broader set of application types and scheduling patterns on a common resource foundation 18

IBM InfoSphere BigInsights For Hadoop Platform Computing performance gain on average over open source Hadoop 1 x Audited STAC Report Securities Technology Analysis Center IBM InfoSphere BigInsights for Hadoop Open Source 1. 4x is approximate value. See the STAC Report at http://www.stacresearch.com/node/15370. Testing involved the SWIM benchmark (https://github.com/swimprojectucb/swim) and jobs derived from production workload traces. Testing was conducted in controlled laboratory conditions. 19

A Single Dashboard Manage multiple tenants including BigInsights and third party workloads 20 IBM Internal Use Only

Transparent access Applications think they have their own private cluster, but with configurable access to shared data 21 IBM Internal Use Only

Flexible configuration of tenant applications Easily solve problems that normally would prevent workloads from sharing infrastructure 22 IBM Internal Use Only

Configurable, Dynamic Resource Sharing Establish configurable resource sharing policies that ensure service levels while maximizing utilization 23 IBM Internal Use Only

USAA deploys a multitenant, shared infrastructure 24 A shared environment for multiple Hadoop analytic applications Business problem Multiple lines of business deploying new applications driving infrastructure cost Rapid growth in storage requirements, traditional DW too expensive Need to facilitate rapid development and deployment of new Hadoop applications obtaining an information advantage Investing in commercial and in-house Hadoop based solutions to enhance existing data warehouse Concerned about uncontrolled growth of Hadoop environments as multiple lines deploy the technology Big Data Solution InfoSphere BigInsights for Hadoop Services and Analytics 18 IBM Platform Symphony for multi-tenancy IBM Elastic Storage (GPFS FPO) InfoSphere BigInsights Business Result Significant infrastructure cost avoidance Approx 30 applications on a shared infrastructure Estimated 4x performance gain on average

IBM Platform Symphony unique capabilities Multi-tenant 100% Hadoop Performances Scalable Robustness Production ready 25

Contact: Emmanuel Lecerf, Big Data expert EMEA Emmanuel.lecerf@fr.ibm.com 26