MapReduce and Intro to Cloud Computing. ID2210: Lecture 11 Jim Dowling

Size: px
Start display at page:

Download "MapReduce and Intro to Cloud Computing. ID2210: Lecture 11 Jim Dowling"

Transcription

1 MapReduce and Intro to Cloud Computing ID2210: Lecture 11 Jim Dowling

2 Large-Scale Distributed Computing In #Nodes - BitTorrent (millions) - Peer-to-Peer In #Instructions/sec - Teraflops, Petaflops, Exascale - Super-Computing In #Bytes stored - Facebook: 30+ Petabytes (July 11)* - NoSQL storages In #Bytes processed/time - Google processes 24 petabytes of data per day -? * moves_30_petabyte_hadoop_cluster_to_new_data_center

3 Big Data The total amount of digital storage worldwide is approaching 1 zettabyte, or 1 million times the contents of the Earth s largest library. Currently that information is archived on equipment with a mass equivalent to 20 percent of Manhattan. Global data storage is expected to reach 35 zettabytes by 2020 The Boston Globe Editorial & Opinion September 7, 2010

4 Programming Large-Scale With thousands of servers available within a data centre, how do we: - write applications for them? - allocate and manage resources? Applications should also be scalable, reliable, and highly available. - Failures are expected with thousands of machines. - Need for load-balancing, handling heterogeneity.

5 Commodity Computing Challenges Cheap nodes fail, especially if you have many - Mean time between failures for 1 node = 3 years - Mean time between failures for 1000 nodes = 1 day - Solution: Build fault-tolerance into system Commodity networks have low(ish) bandwidth - Scan 100TB Datasets on 1000 node cluster with remote 10MB/s = 165 mins - Solution: Push computation to the data Programming distributed systems is hard - Solution: Provide a simple to use data-parallel programming model that distributes work and handles faults.

6 Typical Large-Scale Programming Problem Iterate over a large number of records Extract something of interest from each Map Shuffle and sort intermediate results Aggregate intermediate results Reduce Generate final output Key idea: provide a functional abstraction for these two operations

7 Programming in the Large Imagine we have hundreds of thousands of documents structured as follows: { "type": "blog", "id": "564", "tags": ["hdfs", "mysql", "cats"], "content": "<div>...</div>", "mentions": [ { "google": 6, "apple": 11, "microsoft": 1, } ] }

8 Extract all the mentions of Google

9 Count Mentions in each Blog, then Sum Up

10 Pseudocode for Map Phase Extract the blog_id and the number of mentions of google from each document def mapper(doc.blogs): foreach (blog in docs.blogs): output(blog_id, mentions_google);

11 Pseudocode for Reduce Phases Sum up all the google mentions from the same blog_id def reducer(blog_id, mentions_google): output(agg_id, sum(blog_id, mentions_google)); Sum up all the google mentions from all blogs def reducer(agg_id, mentions_google): output(count, sum(agg_id, mentions_google);

12 Google s MapReduce Programming Model * Slide taken from tutorial by Jerry Zhao and Jelena Pjesivac-Grovic (Google Inc.): MapReduce The Programming Model and Practice. Tutorial held at SIGMETRICS 2009.

13 MapReduce Programming Model Input Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out )

14 MapReduce Programming Model map Takes input records - one by one - key, value Processes records - Independently Outputs intermediate - 1..n per input record - key, value reduce Takes intermediate results - Groups with same key - key, value [] Processes records - Group-wise Outputs result - Per group - Any format

15 MapReduce Workflow partition1 partition2 read worker worker write read worker write file(s) partition3 worker input files map phase intermediate output reduce phase output files

16 MapReduce Basics MapReduce programming model (and framework) that hides the complexity of work distribution and fault tolerance Principal design philosophies: - Near-linear scalability for data sets - Low cost reduce hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time

17 MapReduce-Like Implementations Google MR Hadoop Dryad Availability Proprietary Open Source Proprietary Used by Google Yahoo!, Facebook, Amazon (EC2!), Twitter Microsoft Native API C++ Java C++

18 Load Balancing, Failure, and Stragglers Load Balancing - Break a MapReduce job in small tasks - Schedule tasks on workers as they report idle status MapReduce functions are side-effect free - Enables failed (and partially completed) tasks to be reexecuted without any problems (on a different machine) - When a worker fails, its tasks can be reallocated to other workers Identify and handle stragglers (slow workers) - Restart slow tasks on new workers - Stragglers appear with increasing probability when there are an increasing numbers of workers

19 Components in a Hadoop MR Workflow Next few slides are from:

20 Job Submission

21 Initialization

22 Scheduling

23 Execution

24 Pig Latin: a relational data-flow language MapReduce programs are quite low-level. A higher-level programming model was needed for processing semi-structured data sets using the MapReduce platform Pig Latin is a procedural, relational data-flow language that is implemented using MapReduce Pig Engine: Parser, Optimizer and distributed query execution

25 Pig Example* Input: User profiles, Page visits Problem: Find the top 5 most visited pages by users aged Load Users Load Pages Filter by Age Join on name Group on url Count clicks Order by clicks Take top 5 *Example taken from Yahoo Hadoop tutorial from Middleware

26 In MapReduce

27 In PigLatin Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites ;

28 Pig Latin Data Types Tuple: Ordered set of fields - Field can be simple or complex type - Nested relational model Bag: Collection of tuples. - Can contain duplicates Map: Set of (key, value) pairs Primitive types - int, long, float, double, chararray, bytearray

29 Pig Architecture

30 Hive and HBase Hive and Pig were parallel projects developed at Facebook and Yahoo, respectively. HiveQL is closer to SQL from traditional RDBMSs than Pig Latin (which is procedural). - Due to the limitations of MapReduce, and the fact that HiveQL is compiled into a MapReduce query plan, HiveQL is a cut-down version of SQL HBase is an open-source, distributed, versioned, columnoriented store modeled after Google's Bigtable. A lot of Hive programs run over HBase.

31 MapReduce Filesystem Requirements MapReduce jobs run on data stored in files Support large files - Streaming reads - Mostly append to end (easier concurrency) Scalability - Add machines to scale Workers tasks use data on their local machine - Bandwidth is the bottleneck. Move code to data. Expect failures - Transparently handle failures as much as possible.

32 Hadoop Filesystem (HDFS) Supports huge data sets and large files Gigabytes files, petabyte data sets Supports tens of millions of files in a file system Files have write-once-read-many semantics Clients can only append to existing files Designed to run on COTS hardware - Implemented in Java Timely detection and recovery from data node faults Batch processing rather than interactive user access

33 HDFS Clusters can be very large! HDFS at Facebook (May 10) 21 PB of storage in a single HDFS cluster containing 2000 machines machines with 8 cores machines with 16 cores 12 TB per machine (some have 24 TB) 32 GB of RAM per machine

34 HDFS Architecture Metadata ops Namenode Metadata(Name, replicas..) (/home/foo/data,6... Client Read Write Datanodes Block ops Datanodes replication B Blocks Rack1 Rack2 Image from

35 Typical Hadoop Cluster 40 nodes/rack, nodes in cluster 1 Gbps within a rack; 8 Gbps between racks Aggregation switch Rack switch Image from

36 The HDFS NameNode A single Namenode manages the file system metadata and regulates access to files by clients. FileName->[BlockIds] BlockIds->[replica locations] Controls replication of blocks to DataNodes - Listens for Heartbeats from DataNodes - Signals creating, opening, closing of blocks - Load balancing, rack-aware distribution It is a single-point of failure (as of May 12). Ongoing work on a master-slave failover model.

37 HDFS DataNodes DataNodes store a set of blocks File are split into one or more blocks (64MB default). NameNode sends instructions to DataNodes for block creation, deletion, and replication. Clients read/write data in blocks at DataNodes

38 HDFS Client Read and write HDFS data Create checksum files for files in HDFS. - Recompute the checksum when a file has been read. Caches file data to reduce load on NameNode

39 Related Filesystems Google File System (GFS) - Kosmos File System (KFS) - Open source, Scales better than HDFS C++ implementation Master polls data nodes Supports writing to multiple arbitrary positions in files and file appending

40 Introduction to Cloud Computing

41 Democratization of Large-Scale Computing Cloud computing is the delivery of hosting services that are provided to a client over the Internet. - Enable large-scale services without up-front investment. New programming tools, databases and systems have enabled the low-cost construction of large-scale services.

42 NIST Definition of Cloud Computing "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction."

43 Supporting Technologies Enormous computer data-centres containing commodity hardware. Virtualization of computation, storage, and communication. - Turn hardware and networking into software! Achieve economies of scale. - Reduce costs of electricity, bandwidth, hardware, software and use low-cost locations. - Lower-cost than provisioning own hardware. NoSQL datastores have enabled storage scalability to much higher levels than relational databases.

44 Cloud Computing Essentials Cloud computing is Utility Computing - Cloud services are controlled and monitored by the cloud provider typically through a pay-per-use business model. An ideal cloud computing platform is: - efficient in its use of resources - scalable elastic - self-managing - highly available and accessible - inter-operable and portable

45 Cloud Properties Resource efficiency: computing and network resources are pooled to provide services to multiple users. Resource allocation is dynamically adapted according to user demand. Elasticity: computing resources can be rapidly and elastically provisioned to scale up, and released to scale down based on consumer s demand.

46 Cloud Properties Self-managing services: a consumer can provision cloud services, such as web applications, server time, processing, storage and network as needed and automatically without requiring human interaction with each service s provider Accessible and highly available: cloud resources are available over the network anytime and anywhere and are accessed through standard mechanisms that promote use by different types of platform (e.g., mobile phones, laptops, and PDAs).

47 IaaS, PaaS and SaaS Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) PaaS SaaS Applications Packaged Software IaaS Infrastructure Servers Storage Network Platform OS & Application Stack Infrastructure Servers Storage Network Platform OS & Application Stack Infrastructure Servers Storage Network

48 Spectrum of Cloud Users Image credit:

49 Infrastructure as a Service (IaaS) Virtualization - Virtualization is the abstraction of logical resources away from underlying physical resources. A hypervisor virtualizes a platform s operating system. The hypervisor manages OS as virtual machines (VMs), enabling multiple OS to share the same physical hardware. VM1 VM2 VM3

50 Virtualizing the Network and Storage

51 KVM (Kernel-based Virtual Machine) VMWare and Xen are the best-known virtualization platforms. KVM (Kernel-based Virtual Machine) is an opensource virtualization platform - Linux host OS Run multiple virtual machines (Windows, MAC, etc) on your linux box - IO is virtualized using a device model in KVM KVM requires a modified QEMU (open-source processor emulator) for IO virtualization framework.

52 Virtualization using KVM in Linux KVM is a loadable kernel module - kvm.ko provides the core virtualization infrastructure - kvm-intel.ko / kvm-amd.ko processor specific modules

53 IO Device Model in KVM Original approach with full-virtualization - Guest hardware accesses are intercepted by KVM - QEMU emulates hardware behavior of common devices Video cards PCI Input devices (mouse, keyboard) NICs

54 IaaS is Not Enough IaaS provides virtual machines, but it cannot provide elastic computing by itself, where services scale up and down to meet user demand. - Dynamic provisioning Existing IaaS do not provide support for the sharing middleware platforms among different VMs - Multi-tenancy

55 You Manage You Manage You Manage From IaaS to PaaS Traditional Stack IaaS PaaS Applications Applications Applications Data Data Data Runtime Runtime Runtime Middleware Middleware Middleware OS Servers Storage Networking OS Virtualization Servers Storage Networking Provider Manages OS Virtualization Servers Storage Networking Provider Manages

56 PaaS Platform as a Service (PaaS) is a computing platform that abstracts the infrastructure, OS, and middleware to drive developer productivity. PaaS leverages dynamic provisioning PaaS also leverages multi-tenancy

57 Resources Resources Resources Under-Provisioning In traditional computing, underestimating system utilization results first in lost revenue, then lost customers Capacity Time (days) Demand Capacity Demand Lost Revenue Capacity Demand Lost Users

58 Resources Over-Provisioning Overestimating system utilization results in higher than necessary infrastructure costs Capacity Unused resources Time Demand Dynamically provisioning resources solves the under-/over-provisioning problem.

59 Dynamic Provisioning Cloud computing enables server computing instances to be provisioned or deployed from a administrative console or client application by the server administrator, network administrator, or any other enabled user. Self-managing systems perform dynamic provisioning on behalf of a user or administrator in order to ensure quality of service (QoS) contracts are not broken and/or to meet some policy objectives.

60 How do we reuse middleware services? Image a single physical machine that is currently running 10 virtual machines (VMs), where each VM running has 5 active java programs. Assuming no virtualized application server, how many JVMs processes are running on the physical machine?

61 Multi-Tenancy Multi-tenancy is where a single instance of the software runs on a server, serving multiple clients. - Think multiple users in a MySQL database - Java 8 will support multi-tenancy (many java programs running in the same JVM) The software should be able to provide a single service to all customers by setting configurations - More efficient use of server resources

62 PaaS

63 Software as a Service (Saas)

64 Deployment Model There are four primary cloud deployment models : - Public Cloud - Private Cloud - Community Cloud - Hybrid Cloud

65 Public Clouds Public clouds are owned by cloud service providers who charge for the use of cloud resources. Basic characteristics: - Homogeneous infrastructure, Common policies - Shared resources and multi-tenancy - Leased or rented infrastructure - Economies of scale EC2 (Amazon) Elastic Compute Cloud. General purpose computing. Azure (Microsoft) General purpose computing on a Microsoft platform. AppEngine (Google) Build scalable web applications fast. Not general purpose.

66 Amazon, Microsoft, Google Cloud Offerings Amazon Web Services Microsoft Azure Google AppEngine Computation model(vm) x86 ISA via Xen VM Microsoft CLR VM Predefined application structure and framework Storage model SimpleDB, S3 SQL Data Services MegaStore/BigTable Networking model Declarative specification of IP level topology Automatic based on programmer s declarative descriptions of app components Fixed topology to accommodate 3-tier Web app structure

67 Private Clouds The cloud infrastructure belongs to and is operated by only one organization. Basic characteristics : - Heterogeneous infrastructure; Customized policies - Dedicated resources - In-house infrastructure; End-to-end control Examples include:

68 Public vs. Private Public Cloud Private Cloud Infrastructure Homogeneous Heterogeneous Policy Model Common defined Customized & Tailored Resource Model Shared & Multi-tenant Dedicated Cost Model Economy Model Operational expenditure Large economy of scale Capital expenditure End-to-end control

69 Other types of Clouds Community cloud - The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). Hybrid cloud - The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability.

70 Obstacles To Cloud Computing Data Lock-in No standardized APIs. - OpenStack API Data Confidentiality/Auditability Data transfer bottlenecks/costs Performance unpredictability for systems apps

71 Cloud Computing Summary

72 Challenges for Cloud computing Will all data migrate to the cloud? - The post-pc era. By 2015 some 6.3 exabytes of mobile data will be flowing each month*. How much will end up in the cloud? What will we do with all the new data generated by the Internet of Things and DNA sequencing machines? How will we manage security, ownership and migration of data stored in the cloud? *

73 References NIST (National Institute of Standards and Technology). Dean et al., MapReduce: simplified data processing on large clusters, Comms of ACM, vol 51(1), Armburst et al., Above the Clouds: A Berkeley View of Cloud Computing

74 References Cloud Computing: Principles and Paradigms, R. Buyya et al. (eds.), Wiley, Cloud Computing: Principles, Systems and Applications, L. Gillam et al. (eds.) Springer, Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters in OSDI 2004 Senjay Ghemawat, : The Google File System. SIGOPS Operating Systems Review 37(5), 2003 M. Isard et al.: Dryad: Distributed Data-parallel Programs from Sequential Building Blocks in EuroSys 2007

Intro to Cloud Computing. ID2210 Jim Dowling

Intro to Cloud Computing. ID2210 Jim Dowling Intro to Cloud Computing ID2210 Jim Dowling What Cloud Computing is Not CLOUDS OF CONFUSION AS GALWAY COUNCILLOR TELLS ANOTHER TO GO F..K HIMSELF A GALWAY councillor has refused to apologise for swearing

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Sriram Krishnan, Ph.D. sriram@sdsc.edu

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Discovery 2015: Cloud Computing Workshop June 20-24, 2011 Berkeley, CA Introduction to Cloud Computing Keith R. Jackson Lawrence Berkeley National Lab What is it? NIST Definition Cloud computing is a model

More information

Apache Hadoop FileSystem and its Usage in Facebook

Apache Hadoop FileSystem and its Usage in Facebook Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture

More information

Design and Evolution of the Apache Hadoop File System(HDFS)

Design and Evolution of the Apache Hadoop File System(HDFS) Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

Cloud Computing using MapReduce, Hadoop, Spark

Cloud Computing using MapReduce, Hadoop, Spark Cloud Computing using MapReduce, Hadoop, Spark Benjamin Hindman benh@cs.berkeley.edu Why this talk? At some point, you ll have enough data to run your parallel algorithms on multiple computers SPMD (e.g.,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

Hadoop & its Usage at Facebook

Hadoop & its Usage at Facebook Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming Distributed and Parallel Technology Datacenter and Warehouse Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 Based on earlier versions by

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

BBM467 Data Intensive ApplicaAons

BBM467 Data Intensive ApplicaAons Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes

More information

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing

24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing Advanced Distributed Systems Cristian Klein Department of Computing Science Umeå University During this course Treads in IT Towards a new data center What is Cloud computing? Types of Clouds Making applications

More information

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platforms, Challenges & Hadoop Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platform Models Aditee Rele Microsoft Corporation Dec 8, 2010 IT CAPACITY Provisioning IT Capacity Under-supply

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Cloud Computing Training

Cloud Computing Training Cloud Computing Training TechAge Labs Pvt. Ltd. Address : C-46, GF, Sector 2, Noida Phone 1 : 0120-4540894 Phone 2 : 0120-6495333 TechAge Labs 2014 version 1.0 Cloud Computing Training Cloud Computing

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

HDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sector vs. Hadoop. A Brief Comparison Between the Two Systems Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

Scalable Cloud Computing

Scalable Cloud Computing Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/44 Business Drivers of Cloud Computing Large data centers allow for economics

More information

Data Centers and Cloud Computing

Data Centers and Cloud Computing Data Centers and Cloud Computing CS377 Guest Lecture Tian Guo 1 Data Centers and Cloud Computing Intro. to Data centers Virtualization Basics Intro. to Cloud Computing Case Study: Amazon EC2 2 Data Centers

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Cloud Courses Description

Cloud Courses Description Courses Description 101: Fundamental Computing and Architecture Computing Concepts and Models. Data center architecture. Fundamental Architecture. Virtualization Basics. platforms: IaaS, PaaS, SaaS. deployment

More information

Hadoop: Embracing future hardware

Hadoop: Embracing future hardware Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013

More information

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Hadoop Distributed File System. Jordan Prosch, Matt Kipps Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Lecture 1 Introduction to Cloud Computing Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline What is cloud computing? How it evolves? What are the

More information

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA TRENDS AND TECHNOLOGIES BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.

More information

CLOUD COMPUTING USING HADOOP TECHNOLOGY

CLOUD COMPUTING USING HADOOP TECHNOLOGY CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:narendren.jbk@gmail.com

More information

How To Understand Cloud Computing

How To Understand Cloud Computing Overview of Cloud Computing (ENCS 691K Chapter 1) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ Overview of Cloud Computing Towards a definition

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

MapReduce. Introduction and Hadoop Overview. 13 June 2012. Lab Course: Databases & Cloud Computing SS 2012

MapReduce. Introduction and Hadoop Overview. 13 June 2012. Lab Course: Databases & Cloud Computing SS 2012 13 June 2012 MapReduce Introduction and Hadoop Overview Lab Course: Databases & Cloud Computing SS 2012 Martin Przyjaciel-Zablocki Alexander Schätzle Georg Lausen University of Freiburg Databases & Information

More information

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data Technology Core Hadoop: HDFS-YARN Internals Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class

More information

Contents. 1. Introduction

Contents. 1. Introduction Summary Cloud computing has become one of the key words in the IT industry. The cloud represents the internet or an infrastructure for the communication between all components, providing and receiving

More information

Viswanath Nandigam Sriram Krishnan Chaitan Baru

Viswanath Nandigam Sriram Krishnan Chaitan Baru Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information