Application Development. A Paradigm Shift



Similar documents
How To Scale Out Of A Nosql Database

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Open source Google-style large scale data analysis with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop & its Usage at Facebook

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data With Hadoop

Hadoop & its Usage at Facebook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Large scale processing using Hadoop. Ján Vaňo

Hadoop. Sunday, November 25, 12

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

NoSQL Data Base Basics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data and Apache Hadoop s MapReduce

CS54100: Database Systems

Accelerating and Simplifying Apache

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop IST 734 SS CHUNG

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Design and Evolution of the Apache Hadoop File System(HDFS)

THE HADOOP DISTRIBUTED FILE SYSTEM

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Apache Hadoop FileSystem Internals

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TECHNOLOGY. Hadoop Ecosystem

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

BBM467 Data Intensive ApplicaAons

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Chase Wu New Jersey Ins0tute of Technology

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

MapReduce with Apache Hadoop Analysing Big Data

NoSQL and Hadoop Technologies On Oracle Cloud

White Paper: Hadoop for Intelligence Analysis

A Brief Outline on Bigdata Hadoop

White Paper: What You Need To Know About Hadoop

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop Architecture. Part 1

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Comparative analysis of Google File System and Hadoop Distributed File System

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop: Embracing future hardware

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

MapReduce, Hadoop and Amazon AWS

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

L1: Introduction to Hadoop

Snapshots in Hadoop Distributed File System

Apache Hadoop: Past, Present, and Future

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Efficient Analysis of Big Data Using Map Reduce Framework

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Introduction to Hadoop

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

BIG DATA CHALLENGES AND PERSPECTIVES

Large-Scale Data Processing

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Data Mining in the Swamp

Dell In-Memory Appliance for Cloudera Enterprise

Introduction to Hadoop

Hadoop Distributed File System (HDFS) Overview

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop Big Data for Processing Data and Performing Workload

A very short Intro to Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Apache Hadoop. Alexandru Costan

Transcription:

Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1

Motivation for the Paradigm Shift The performance of the traditional applications is becoming inadequate Data is growing faster than Moore s Law [IDC study] The new buzzword is Big Data Application performance and scalability are limited it by the performance and scalability of storage (disk I/O) and network Applications must capitalize on the benefits provided by the cloud environment No significant improvements in scaling up (scale vertically or upgrade a single node) Scaling out (scale horizontally or add more nodes) is a better approach New 2007 Template - 2

How are Others Solving Big Data Problem? Google invented a software framework called MapReduce to support distributed computing on large data sets on clusters of computers Google also invented Google Bigtable, a distributed storage system for managing petabytes of data Hadoop: An open source Apache project inspired by the work at Google. A software framework for reliable, scalable, distributed computing There are a number of other Hadoop related projects at Apache Major contributors include Yahoo, Facebook, and Twitter S4 Distributed Stream Computing Platform: An Apache incubator program. Supports development of applications for processing continuous unbounded streams of data Many more We ll review some technologies in the Hadoop ecosystem New 2007 Template - 3

What is Apache Hadoop? The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing A framework for the distributed processing of large data sets across clusters of computers using a simple programming model Designed to scale from single server to thousands of servers. Each server offers local computation and storage The Hadoop core consists of: Hadoop Distributed File System (HDFS ): A distributed file system that provides high-throughput access to application data Hadoop MapReduce: A software framework for distributed processing of large data sets on clusters New 2007 Template - 4

MapReduce Data Flow Hadoop MapReduce Master = JobTracker Worker=TaskTracker New 2007 Template - 5

Mapreduce Example: Word Count cloud computing overview heterogeneous high performance cloud computing cloud computing applications in space system acquisitions applications development in the cloud space flight dynamics as a service Input files worker worker worker Map Phase cloud,1; computing,1 overview,1 heterogeneous,1 a,1 high,1 performance,1 cloud,1 computing,1 cloud,1 computing,1 applications,1 in,1 space,1 system,1 acquisitions,1 space,1 flight,1 dynamics,1 as,1 a,1 Service,1 applications,1 development,1 in,1 the,1 cloud,1 Intermediate files worker worker Reduce Phase as,1 acquisitions,1 applications,2 cloud,4; computing,3 development,1 dynamics,1 flight,1 heterogeneous,1 high,1 in,2 performance,1 overview,1 space,2 system,1 service,1 the,1 Output files New 2007 Template - 6

HDFS Very large distributed file system on commodity hardware Designed to scale to petabytes of storage Single Namespace for entire cluster Optimized for batch processing Write-once-read-many read access model Read, write, delete, append-only NameNode handles replication, deletion, creation DataNode handles data retrieval New 2007 Template - 7

HDFS Files are broken up NameNode BackupNode into blocks Block size is 64 MB (can be modified) Blocks replicated on at least 3 nodes If a node fails, blocks get replicated automatically to other nodes DataNode DataNode DataNode DataNode DataNode New 2007 Template - 8

Hadoop Cluster at Yahoo! Largest cluster: 4000 nodes, 16PB raw disk, 64TB RAM New 2007 Template - 9

Who Uses Hadoop? Adobe AOL Disney Ebay Facebook Google Hulu IBM LinkedIn New York Times Powerset/Microsoft Rackspace StumbleUpon Twitter Yahoo! Full list: http://wiki.apache.org/hadoop/poweredby New 2007 Template - 10

Examples New York Times Converted public domain articles from 1851-1922 to PDF Stored 4 TB of scanned images as input on Amazon S3 Ran 100 Amazon EC2 instances for around 24 hours Produced d 11 Million articles, 1.5 TB of PDF files as output t http://open.blogs.nytimes.com/2007/11/01/self-service-proratedsuper-computing-fun/ Visa Processing 73 billion transactions, amounting to 36 TB of data Traditional methods: 1 month Hadoop: 13 minutes How Hadoop Tames Enterprises Big Data. Sreedhar Kajeepeta, InformationWeek Reports, Feb 2012 New 2007 Template - 11

Sorting Benchmarks Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Sec http://developer.yahoo.com/blogs/hadoop/posts/2009/05/had oop_sorts_a_petabyte_in_162/onds p New 2007 Template - 12

The Hadoop Ecosystem HBase : Initiated by Powerset. A scalable, distributed database that supports structured data storage for large tables Hive : Initiated by Facebook. A data warehouse infrastructure that provides data summarization and ad hoc querying Pig : Initiated by Yahoo. A high-level data-flow language and execution framework for parallel computation ZooKeeper : Initiated by Yahoo. A high-performance coordination service for distributed ted applications Avro : A data serialization system New 2007 Template - 13

HBase Open source project modeled after Google s BigTable Distributed, large-scale data store Efficient at random reads/writes Use Hbase when you: need to store large amounts of data need high write throughput need efficient random access within large data sets need to scale gracefully with data need to save time-series ordered data don t need full RDMS capabilities (cross row/cross table transactions, joins, etc.) New 2007 Template - 14

HBase Use Cases Mozilla s Socorro - crash reporting system 2.5 Million crash reports/day Facebook Real-time Messaging System: HBase to Store 135+ Billion Messages a Month Realtime Analytics System: HBase to Process 20 Billion Events Per Day Twitter, Yahoo!, TrendMicro, StumbleUpon, See http://wiki.apache.org/hadoop/hbase/poweredby New 2007 Template - 15

OpenTSDB 15464 points retrieved, 932 points plotted in 100ms Generating custom graphs and correlating events is easy A distributed, scalable, time series database developed by StumbleUpon Open-source monitoring system built on Hbase It is also a data plotting system. Used to monitor computers in their datacenter Collects 600 million data points per day, over 70 billion data points per year Store them forever, retain granular data New 2007 Template - 16

Typical Use Cases in Ground Systems Domain Hadoop can be used to scale out applications that fit the write-once-read-many read many model Typical use cases: Telemetry analysis and visualization Real time displays Storage and retrieval of spacecraft and ground system events Payload data processing New 2007 Template - 17

Summary In order to capitalize on the benefits of the cloud environment, applications must be designed to scale out rather than scale up Robust open source tools to support computing on a cluster of computers are available and in use today The Hadoop Ecosystem is used in many enterprises The performance and scalability of ground systems can be improved by using the growing ecosystem of Hadoop tools and technologies Applications that fit the write-once-read-many model and process large amount of data benefit from this approach New 2007 Template - 18

References Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Volume 51, Issue 1, pp. 107-113, 2008 Apache Hadoop: http://hadoop.apache.org/ Apache Hadoop: an introduction. Todd Lipcon, May 27, 2010 Hadoop Update, Open Cirrus Summit 2009.Eric Baldeschwieler, VP Hadoop Development, Yahoo! Cloudera Training: http://www.cloudera.com/hadoop-training Realtime Apache Hadoop at Facebook. Dhruba Borthakur, Oct 5, 2011 at UC Berkeley Graduate Course Apache Hbase: hbase.apache.org http://www.slideshare.net/kevinweil/nosql-attwitter-nosql-eu-2010 attwitter nosql eu http://www.slideshare.net/ydn/3-hadooppigattwitterhadoopsummit2010 OpenTSDB, The Distributed, Scalable, Time Series Database. Benoît tsuna Sigoure, StumbleUpon New 2007 Template - 19