The Evolving Apache Hadoop Eco-System

Similar documents

Big Data Realities Hadoop in the Enterprise Architecture

Big Data: Making Sense of it all!

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

HDP Hadoop From concept to deployment.

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HDP Enabling the Modern Data Architecture

Upcoming Announcements

Case Study : 3 different hadoop cluster deployments

Design and Evolution of the Apache Hadoop File System(HDFS)

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Modernizing Your Data Warehouse for Hadoop

Big Data Trends and HDFS Evolution

BIG DATA TRENDS AND TECHNOLOGIES

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop: Embracing future hardware

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Please give me your feedback

Modern Data Architecture for Predictive Analytics

The Future of Data Management

The Future of Data Management with Hadoop and the Enterprise Data Hub

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Comprehensive Analytics on the Hortonworks Data Platform

#TalendSandbox for Big Data

Luncheon Webinar Series May 13, 2013

CSE-E5430 Scalable Cloud Computing Lecture 2

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Architecture. Part 1

SAP and Hortonworks Reference Architecture

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Bringing Big Data to People

Hadoop & its Usage at Facebook

Apache Hadoop FileSystem and its Usage in Facebook

Extending Hadoop beyond MapReduce

A Modern Data Architecture with Apache Hadoop

Apache Hadoop: The Big Data Refinery

Certified Big Data and Apache Hadoop Developer VS-1221

A very short Intro to Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data With Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Are You Ready for Big Data?

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Are You Ready for Big Data?

Hadoop & its Usage at Facebook

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Accelerating and Simplifying Apache

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

So What s the Big Deal?

Using Tableau Software with Hortonworks Data Platform

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Big Data 101 Webinar

Application Development. A Paradigm Shift

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Hadoop: Past, Present, and Future

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big + Fast + Safe + Simple = Lowest Technical Risk

Hadoop Ecosystem B Y R A H I M A.

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

HadoopTM Analytics DDN

Deploying Hadoop with Manager

Evolution from Big Data to Smart Data

Hadoop and Map-Reduce. Swati Gore

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Chapter 7. Using Hadoop Cluster and MapReduce

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Teradata Unified Big Data Architecture

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

The Hadoop Distributed File System

<Insert Picture Here> Big Data

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hadoop IST 734 SS CHUNG

HADOOP MOCK TEST HADOOP MOCK TEST I

The Internet of Things and Big Data: Intro

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Oracle Big Data SQL Technical Update

Hadoop Big Data for Processing Data and Performing Workload

I/O Considerations in Big Data Analytics

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Transcription:

The Evolving Apache Hadoop Eco-System What it means for Big Data Analytics and Storage Sanjay Radia Architect/Founder, Hortonworks Inc. All Rights Reserved Page 1

Outline Hadoop and Big Data Analytics Data platform drivers and How Hadoop Changes the Game Data Refinery Hadoop in Enterprise Data Architecture Hadoop (HDFS) and Storage RAID disk or not? HDFS s Generic storage layer Archival Upcoming Page 2

Next Generation Data Platform Drivers Business Drivers Technical Drivers Financial Drivers Need for new business models & faster growth (20%+) Find insights for competitive advantage & optimal returns Data continues to grow exponentially Data is increasingly everywhere and in many formats Legacy solutions unfit for new requirements growth Cost of data systems, as % of IT spend, continues to grow Cost advantages of commodity hardware & open source

Big Data Changes the Game Petabytes BIG DATA Transactions + Interactions + Observations = BIG DATA Mobile Web Sentiment SMS/MMS Speech to Text User Click Stream Terabytes WEB Social Interactions & Feeds Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Dynamic Pricing Segmentation Megabytes ERP Purchase detail Purchase record Payment record Business Data Feeds External Demographics Customer Touches Search Marketing Affiliate Networks Support Contacts Offer details Dynamic Funnels Offer history User Generated Content HD Video, Audio, Images Product/Service Logs Increasing Data Variety and Complexity

How Hadoop Changes the Game Cost Commodity servers, JBod disks Horizontal scaling new servers as needed Ride the technology curve Scale from small to very large Size of data or size of computation The Data-Refinery Traditional Datamarts optimize for time and space Don t need to pre-select the subset of schema Can experiment to gain insights into your whole data You insights can change over time Don t need to throw your old data away Open and Growing Eco-system You aren t locked into a vendor Ever growing choice of tools Page 5

Vertical Specific Uses of Hadoop Vertical Analytical Use Cases Operational Use Cases Retail & Web Social Network Analysis Log Analysis/Site Optimization Session & Content Optimization Dynamic Pricing Retail Brand and Sentiment Analysis Loyalty Program Optimization Dynamic Pricing/Targeted Offers Person of Interest Discovery Threat Identification Cross Jurisdiction Queries Finance Risk Modeling & Fraud Identification Trade Performance Analytics Surveillance and Fraud Detection Customer Risk Analysis Energy Smart Grid: Production Optimization Gris Failure Prevention Smart Meters Individual Power Grid Customer Churn Analysis Supply Chain Optimization Dynamic Delivery Replacement parts Electronic Medical Records (EMPI) Clinical Trials Analysis Insurance Premium Determination Intelligence Manufacturing Healthcare & Payer

Transaction + Interactions + Observations Retain runtime models and historical data for ongoing refinement & analysis Audio, Video, Images 4 Business Transactions & Interactions Docs, Text, XML Web Logs, Clicks Big Data Refinery Social, Graph, Feeds Sensors, Devices, RFID Spatial, GPS Events, Other Web, Mobile, CRM, ERP, SCM, 3 Share refined data and runtime models 1 Classic ETL processing 2 Store, aggregate, and transform multi-structured data to unlock value Business Intelligence & Analytics Retain historical data to unlock additional value 5 Dashboards, Reports, Visualization, Page 7

Next-Generation Big Data Architecture Web, Mobile, CRM, ERP, SCM, Audio, Video, Images Docs, Text, XML Web Logs, Clicks Social, Graph, Feeds Big Data Refinery Sensors, Devices, RFID SQL NoSQL NewSQL EDW MPP NewSQL Business Intelligence & Analytics Spatial, GPS Events, Other Business Transactions & Interactions Dashboards, Reports, Visualization, Page 8

Hadoop in Enterprise Data Architecture Web Existing Business Infrastructure IDE & Dev Tools ODS & Datamarts Applications & Spreadsheets New Tech Datameer Tablaeu Karmasphere Splunk Web Applications Visualization & Intelligence Operations Discovery Tools Low Latency/ NoSQL EDW Custom Templeton WebHDFS Sqoop FlumeNG HCatalog Pig HBase Hive MapReduce Existing HDFS Ambari Oozie HA ZooKeeper CRM ERP financials Social Media Exhaust Data logs files Data Sources (transactions, observations, interactions) Page 9

Metadata Services Apache HCatalog provides flexible metadata services across tools and external access Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive) Accessibility: share data as tables in and out of HDFS Availability: enables flexible, thin-client access via REST API HCatalog Raw Hadoop data Inconsistent, unknown Tool specific access Table access Aligned metadata REST API Shared table and schema management opens the platform

Outline Hadoop and Big Data Analytics Data platform drivers and How Hadoop Changes the Game Data Refinery Hadoop in Enterprise Data Architecture Hadoop (HDFS) and Storage RAID disk or not? HDFS s Generic storage layer Archival Upcoming Page 11

HDFS: Scalable, Reliable, Manageable Scale IO, Storage, CPU Add commodity servers & JBODs 6K nodes in cluster, 120PB r Fault Tolerant & Easy management r Built in redundancy r Tolerate disk and node failures r Automatically manage addition/ removal of nodes r One operator per 3K node!! r Storage server used for computation r Move computation to data r Not a SAN r But high-bandwidth network access to data via Ethernet r Scalable file system r Read, Write, rename, append Core Switch Switch Switch Switch ` r No random writes Simplicity of design why a small team could build such a large system in the first place 12

HDFS Architecture: Two Key Layers Namespace layer Multiple independent namespace Block Storage Namespace Can be mounted using client side mount tables Consists of dirs, files and blocks NS Block Management DataNode Operations: create, delete, modify and list dir/files NS DataNode Storage Block Storage layer Generic Block service DataNode cluster membership Block Pool set of blocks for a Namespace volume Storage shared across all volumes no partitioning Namespace Volume = Namespace + Block Pool Block operations Create/delete/modify/getBlockLocation operations Read and write access to blocks Detect and Tolerate faults Monitor node/disk health, periodic block verification Recover from failures replication and placement 13

File systems Background (3): Leading to Google FS and HDFS > Separation of metadata from data - 1978, 1980 l Separating Data from Function in a Distributed File System (1978) l by J E Israel, J G Mitchell, H E Sturgis l A universal file server (1980) by A D Birrell, R M Needham > Horizontal scaling of storage nodes and io bandwidth (1999-2003) l Several startups building scalable NFS late 1990s l Luster (1999-2002) l Google FS (2003) l (Hadoop s HDFS, pnfs) > Commodity HW with JBODs, Replication, Non-posix semantics l Google FS (2003) - Not using RAID a very important decision > Computation close to the data l Parallel DBs l Google FS/MapReduce

Significance of not using Disk RAID Key new idea was to not use RAID on the local disk and instead replicate the data Uniformly Deal with device failure, media failures, node failures etc. Nodes and clusters continue run, with slightly lower capacity Nodes and disks can be fixed when convenient Recover from failures in parallel Raid5 disk take over half a day to recover from a 1TB failure HDFS recovers 12TB in minutes recover is in parallel Faster for larger clusters Operational advantage: system recovers automatically from node and disk failures very rapidly 1 operator for 4K servers at mature Hadoop installations Yes there is the storage overhead File-RAID is available (1.4-1.6 efficiency) Practical to use for portion of data otherwise node recover takes long Page 15

HDFS Generic Storage Service Opportunities for Innovation Federation - Distributed (Partitioned) Namespace Simple and Robust due to independent masters Scalability, Isolation, Availability Alternate NN Implementation MR tmp Namespace New Services Independent Block Pools New FS - Partial namespace in memory MR Tmp storage, HBase directly on block storage Shadow file system caches HDFS, NFS, S3 Future: move Block Management in DataNodes Simplifies namespace/application implementation Distributed namenode becomes significantly simple Hortonworks Inc. 2012 HBase HDFS Storage Service

Archival data where should it sit? Hadoop encourages old data for future analysis Should it sit in a separate cluster with lower computing power? Spindles are one of the bottlenecks in Big data system. Can t waste precious spindles to another cluster where they are underutilized Better to keep archival data in the main cluster where the spindles can be actively used As data grows over time, a Big data cluster may have smaller percentage of hot data Should we arrange data on the disk platter based on access patterns? Challenge growing data Do tapes play a role? Page 17

HDFS in Hadoop 1 and Hadoop 2 Hadoop 1 (GA) Hadoop 2 (alpha) Security New Append Append/Fsync (HBase) Federation WebHDFS + Spnego Wire compatibility Write pipeline improvements Edit logs rewrite Local write optimization Faster startup Performance improvements HA NameNode (hot) Disk-fail-in-place Full Stack HA in progress HA Namenode Full Stack HA Page 18

Hadoop Full Stack HA Architecture Slave Nodes of Hadoop Cluster job Slave Server job job Slave Server Apps Running Outside Slave Server job job Slave Server Slave Server Failover JT into Safemode NN JT Server Server NN Server N+K failover HA Cluster for Master Daemons 19

Upcoming Continue improvements Performance Monitoring and Management Rolling upgrades Disaster Recovery Snapshots (prototype already published) Support for heterogeneous storage on DataNodes Will allow Flash disks to be used for specific use cases Block grouping - allow large number of smaller blocks Working-set of namespace in memory large # of files Other protocols NFS Page 20

Which Apache Hadoop Distro? Stable Reliable Well supported But do not want to lock into a vendor Hortonworks Inc. 2012 Page 21

Hortonworks Data Platform Simplify deployment to get started quickly and easily Monitor, manage any size cluster with familiar console and tools 1 Only platform to include data integration services to interact with any data source Hortonworks Data Platform Metadata services opens the platform for integration with existing applications Delivers enterprise grade functionality on a proven Apache Hadoop distribution to ease management, simplify use and ease integration into the enterprise Dependable high availability architecture The only 100% open source data platform for Apache Hadoop

Hortonworks Data Platform Hadoop 1.0.x = stable kernel Integrates and tests all necessary Apache component projects Most stable and compatible versions of all components are chosen 5.1.1 0.92.1+ 1.0.3 0.9.2 3.3.4 0.4.0 3.3.4 Talend 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.3 beta Ambari Zookeeper Oozie Sqoop HBase 0.4.0 Hive 1.0.3 Pig 0.9.0+ HCatalog QE/Beta process that produced every single stable Apache Hadoop release Apache Hadoop 2 is being QE ed and alpha/beta tested through the same process by Hortonworks 3.1.3 0.9.0+ Hadoop Closely aligned with each Apache component project code line with closedown process that provides necessary buffer beta 5.1.1 Hortonworks Distribution Tested, Hardened & Proven Distribution Reduces Risk Page 23

Summary Hadoop changes Big Data game in fundamental ways Cost, storage and compute capacity Horizontally scale from small to very very large Data Refinery keep all your data, new business insights Hooks to integrate with existing enterprise data architecture Open, growing eco-system no lock in Hortonworks and HDP Distribution The team that originally created Apache Hadoop at Yahoo The team that is driving key developments in Apache Hadoop QE ed and alpha/beta tested every stable Hadoop release including Hadoop 1 and Hadoop 2 (alpha) HDP is fully open source to apache nothing held back close/identical to Apache releases Page 24

Thank You! Questions & Answers Hortonworks Inc. 2012 Page 25