MapReduce and Intro to Cloud Computing. ID2210: Lecture 11 Jim Dowling
|
|
- John Johnston
- 8 years ago
- Views:
Transcription
1 MapReduce and Intro to Cloud Computing ID2210: Lecture 11 Jim Dowling
2 Large-Scale Distributed Computing In #Nodes - BitTorrent (millions) - Peer-to-Peer In #Instructions/sec - Teraflops, Petaflops, Exascale - Super-Computing In #Bytes stored - Facebook: 30+ Petabytes (July 11)* - NoSQL storages In #Bytes processed/time - Google processes 24 petabytes of data per day -? * moves_30_petabyte_hadoop_cluster_to_new_data_center
3 Big Data The total amount of digital storage worldwide is approaching 1 zettabyte, or 1 million times the contents of the Earth s largest library. Currently that information is archived on equipment with a mass equivalent to 20 percent of Manhattan. Global data storage is expected to reach 35 zettabytes by 2020 The Boston Globe Editorial & Opinion September 7, 2010
4 Programming Large-Scale With thousands of servers available within a data centre, how do we: - write applications for them? - allocate and manage resources? Applications should also be scalable, reliable, and highly available. - Failures are expected with thousands of machines. - Need for load-balancing, handling heterogeneity.
5 Commodity Computing Challenges Cheap nodes fail, especially if you have many - Mean time between failures for 1 node = 3 years - Mean time between failures for 1000 nodes = 1 day - Solution: Build fault-tolerance into system Commodity networks have low(ish) bandwidth - Scan 100TB Datasets on 1000 node cluster with remote 10MB/s = 165 mins - Solution: Push computation to the data Programming distributed systems is hard - Solution: Provide a simple to use data-parallel programming model that distributes work and handles faults.
6 Typical Large-Scale Programming Problem Iterate over a large number of records Extract something of interest from each Map Shuffle and sort intermediate results Aggregate intermediate results Reduce Generate final output Key idea: provide a functional abstraction for these two operations
7 Programming in the Large Imagine we have hundreds of thousands of documents structured as follows: { "type": "blog", "id": "564", "tags": ["hdfs", "mysql", "cats"], "content": "<div>...</div>", "mentions": [ { "google": 6, "apple": 11, "microsoft": 1, } ] }
8 Extract all the mentions of Google
9 Count Mentions in each Blog, then Sum Up
10 Pseudocode for Map Phase Extract the blog_id and the number of mentions of google from each document def mapper(doc.blogs): foreach (blog in docs.blogs): output(blog_id, mentions_google);
11 Pseudocode for Reduce Phases Sum up all the google mentions from the same blog_id def reducer(blog_id, mentions_google): output(agg_id, sum(blog_id, mentions_google)); Sum up all the google mentions from all blogs def reducer(agg_id, mentions_google): output(count, sum(agg_id, mentions_google);
12 Google s MapReduce Programming Model * Slide taken from tutorial by Jerry Zhao and Jelena Pjesivac-Grovic (Google Inc.): MapReduce The Programming Model and Practice. Tutorial held at SIGMETRICS 2009.
13 MapReduce Programming Model Input Data type: key-value records Map function: (K in, V in ) list(k inter, V inter ) Reduce function: (K inter, list(v inter )) list(k out, V out )
14 MapReduce Programming Model map Takes input records - one by one - key, value Processes records - Independently Outputs intermediate - 1..n per input record - key, value reduce Takes intermediate results - Groups with same key - key, value [] Processes records - Group-wise Outputs result - Per group - Any format
15 MapReduce Workflow partition1 partition2 read worker worker write read worker write file(s) partition3 worker input files map phase intermediate output reduce phase output files
16 MapReduce Basics MapReduce programming model (and framework) that hides the complexity of work distribution and fault tolerance Principal design philosophies: - Near-linear scalability for data sets - Low cost reduce hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time
17 MapReduce-Like Implementations Google MR Hadoop Dryad Availability Proprietary Open Source Proprietary Used by Google Yahoo!, Facebook, Amazon (EC2!), Twitter Microsoft Native API C++ Java C++
18 Load Balancing, Failure, and Stragglers Load Balancing - Break a MapReduce job in small tasks - Schedule tasks on workers as they report idle status MapReduce functions are side-effect free - Enables failed (and partially completed) tasks to be reexecuted without any problems (on a different machine) - When a worker fails, its tasks can be reallocated to other workers Identify and handle stragglers (slow workers) - Restart slow tasks on new workers - Stragglers appear with increasing probability when there are an increasing numbers of workers
19 Components in a Hadoop MR Workflow Next few slides are from:
20 Job Submission
21 Initialization
22 Scheduling
23 Execution
24 Pig Latin: a relational data-flow language MapReduce programs are quite low-level. A higher-level programming model was needed for processing semi-structured data sets using the MapReduce platform Pig Latin is a procedural, relational data-flow language that is implemented using MapReduce Pig Engine: Parser, Optimizer and distributed query execution
25 Pig Example* Input: User profiles, Page visits Problem: Find the top 5 most visited pages by users aged Load Users Load Pages Filter by Age Join on name Group on url Count clicks Order by clicks Take top 5 *Example taken from Yahoo Hadoop tutorial from Middleware
26 In MapReduce
27 In PigLatin Users = load users as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into top5sites ;
28 Pig Latin Data Types Tuple: Ordered set of fields - Field can be simple or complex type - Nested relational model Bag: Collection of tuples. - Can contain duplicates Map: Set of (key, value) pairs Primitive types - int, long, float, double, chararray, bytearray
29 Pig Architecture
30 Hive and HBase Hive and Pig were parallel projects developed at Facebook and Yahoo, respectively. HiveQL is closer to SQL from traditional RDBMSs than Pig Latin (which is procedural). - Due to the limitations of MapReduce, and the fact that HiveQL is compiled into a MapReduce query plan, HiveQL is a cut-down version of SQL HBase is an open-source, distributed, versioned, columnoriented store modeled after Google's Bigtable. A lot of Hive programs run over HBase.
31 MapReduce Filesystem Requirements MapReduce jobs run on data stored in files Support large files - Streaming reads - Mostly append to end (easier concurrency) Scalability - Add machines to scale Workers tasks use data on their local machine - Bandwidth is the bottleneck. Move code to data. Expect failures - Transparently handle failures as much as possible.
32 Hadoop Filesystem (HDFS) Supports huge data sets and large files Gigabytes files, petabyte data sets Supports tens of millions of files in a file system Files have write-once-read-many semantics Clients can only append to existing files Designed to run on COTS hardware - Implemented in Java Timely detection and recovery from data node faults Batch processing rather than interactive user access
33 HDFS Clusters can be very large! HDFS at Facebook (May 10) 21 PB of storage in a single HDFS cluster containing 2000 machines machines with 8 cores machines with 16 cores 12 TB per machine (some have 24 TB) 32 GB of RAM per machine
34 HDFS Architecture Metadata ops Namenode Metadata(Name, replicas..) (/home/foo/data,6... Client Read Write Datanodes Block ops Datanodes replication B Blocks Rack1 Rack2 Image from
35 Typical Hadoop Cluster 40 nodes/rack, nodes in cluster 1 Gbps within a rack; 8 Gbps between racks Aggregation switch Rack switch Image from
36 The HDFS NameNode A single Namenode manages the file system metadata and regulates access to files by clients. FileName->[BlockIds] BlockIds->[replica locations] Controls replication of blocks to DataNodes - Listens for Heartbeats from DataNodes - Signals creating, opening, closing of blocks - Load balancing, rack-aware distribution It is a single-point of failure (as of May 12). Ongoing work on a master-slave failover model.
37 HDFS DataNodes DataNodes store a set of blocks File are split into one or more blocks (64MB default). NameNode sends instructions to DataNodes for block creation, deletion, and replication. Clients read/write data in blocks at DataNodes
38 HDFS Client Read and write HDFS data Create checksum files for files in HDFS. - Recompute the checksum when a file has been read. Caches file data to reduce load on NameNode
39 Related Filesystems Google File System (GFS) - Kosmos File System (KFS) - Open source, Scales better than HDFS C++ implementation Master polls data nodes Supports writing to multiple arbitrary positions in files and file appending
40 Introduction to Cloud Computing
41 Democratization of Large-Scale Computing Cloud computing is the delivery of hosting services that are provided to a client over the Internet. - Enable large-scale services without up-front investment. New programming tools, databases and systems have enabled the low-cost construction of large-scale services.
42 NIST Definition of Cloud Computing "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction."
43 Supporting Technologies Enormous computer data-centres containing commodity hardware. Virtualization of computation, storage, and communication. - Turn hardware and networking into software! Achieve economies of scale. - Reduce costs of electricity, bandwidth, hardware, software and use low-cost locations. - Lower-cost than provisioning own hardware. NoSQL datastores have enabled storage scalability to much higher levels than relational databases.
44 Cloud Computing Essentials Cloud computing is Utility Computing - Cloud services are controlled and monitored by the cloud provider typically through a pay-per-use business model. An ideal cloud computing platform is: - efficient in its use of resources - scalable elastic - self-managing - highly available and accessible - inter-operable and portable
45 Cloud Properties Resource efficiency: computing and network resources are pooled to provide services to multiple users. Resource allocation is dynamically adapted according to user demand. Elasticity: computing resources can be rapidly and elastically provisioned to scale up, and released to scale down based on consumer s demand.
46 Cloud Properties Self-managing services: a consumer can provision cloud services, such as web applications, server time, processing, storage and network as needed and automatically without requiring human interaction with each service s provider Accessible and highly available: cloud resources are available over the network anytime and anywhere and are accessed through standard mechanisms that promote use by different types of platform (e.g., mobile phones, laptops, and PDAs).
47 IaaS, PaaS and SaaS Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS) PaaS SaaS Applications Packaged Software IaaS Infrastructure Servers Storage Network Platform OS & Application Stack Infrastructure Servers Storage Network Platform OS & Application Stack Infrastructure Servers Storage Network
48 Spectrum of Cloud Users Image credit:
49 Infrastructure as a Service (IaaS) Virtualization - Virtualization is the abstraction of logical resources away from underlying physical resources. A hypervisor virtualizes a platform s operating system. The hypervisor manages OS as virtual machines (VMs), enabling multiple OS to share the same physical hardware. VM1 VM2 VM3
50 Virtualizing the Network and Storage
51 KVM (Kernel-based Virtual Machine) VMWare and Xen are the best-known virtualization platforms. KVM (Kernel-based Virtual Machine) is an opensource virtualization platform - Linux host OS Run multiple virtual machines (Windows, MAC, etc) on your linux box - IO is virtualized using a device model in KVM KVM requires a modified QEMU (open-source processor emulator) for IO virtualization framework.
52 Virtualization using KVM in Linux KVM is a loadable kernel module - kvm.ko provides the core virtualization infrastructure - kvm-intel.ko / kvm-amd.ko processor specific modules
53 IO Device Model in KVM Original approach with full-virtualization - Guest hardware accesses are intercepted by KVM - QEMU emulates hardware behavior of common devices Video cards PCI Input devices (mouse, keyboard) NICs
54 IaaS is Not Enough IaaS provides virtual machines, but it cannot provide elastic computing by itself, where services scale up and down to meet user demand. - Dynamic provisioning Existing IaaS do not provide support for the sharing middleware platforms among different VMs - Multi-tenancy
55 You Manage You Manage You Manage From IaaS to PaaS Traditional Stack IaaS PaaS Applications Applications Applications Data Data Data Runtime Runtime Runtime Middleware Middleware Middleware OS Servers Storage Networking OS Virtualization Servers Storage Networking Provider Manages OS Virtualization Servers Storage Networking Provider Manages
56 PaaS Platform as a Service (PaaS) is a computing platform that abstracts the infrastructure, OS, and middleware to drive developer productivity. PaaS leverages dynamic provisioning PaaS also leverages multi-tenancy
57 Resources Resources Resources Under-Provisioning In traditional computing, underestimating system utilization results first in lost revenue, then lost customers Capacity Time (days) Demand Capacity Demand Lost Revenue Capacity Demand Lost Users
58 Resources Over-Provisioning Overestimating system utilization results in higher than necessary infrastructure costs Capacity Unused resources Time Demand Dynamically provisioning resources solves the under-/over-provisioning problem.
59 Dynamic Provisioning Cloud computing enables server computing instances to be provisioned or deployed from a administrative console or client application by the server administrator, network administrator, or any other enabled user. Self-managing systems perform dynamic provisioning on behalf of a user or administrator in order to ensure quality of service (QoS) contracts are not broken and/or to meet some policy objectives.
60 How do we reuse middleware services? Image a single physical machine that is currently running 10 virtual machines (VMs), where each VM running has 5 active java programs. Assuming no virtualized application server, how many JVMs processes are running on the physical machine?
61 Multi-Tenancy Multi-tenancy is where a single instance of the software runs on a server, serving multiple clients. - Think multiple users in a MySQL database - Java 8 will support multi-tenancy (many java programs running in the same JVM) The software should be able to provide a single service to all customers by setting configurations - More efficient use of server resources
62 PaaS
63 Software as a Service (Saas)
64 Deployment Model There are four primary cloud deployment models : - Public Cloud - Private Cloud - Community Cloud - Hybrid Cloud
65 Public Clouds Public clouds are owned by cloud service providers who charge for the use of cloud resources. Basic characteristics: - Homogeneous infrastructure, Common policies - Shared resources and multi-tenancy - Leased or rented infrastructure - Economies of scale EC2 (Amazon) Elastic Compute Cloud. General purpose computing. Azure (Microsoft) General purpose computing on a Microsoft platform. AppEngine (Google) Build scalable web applications fast. Not general purpose.
66 Amazon, Microsoft, Google Cloud Offerings Amazon Web Services Microsoft Azure Google AppEngine Computation model(vm) x86 ISA via Xen VM Microsoft CLR VM Predefined application structure and framework Storage model SimpleDB, S3 SQL Data Services MegaStore/BigTable Networking model Declarative specification of IP level topology Automatic based on programmer s declarative descriptions of app components Fixed topology to accommodate 3-tier Web app structure
67 Private Clouds The cloud infrastructure belongs to and is operated by only one organization. Basic characteristics : - Heterogeneous infrastructure; Customized policies - Dedicated resources - In-house infrastructure; End-to-end control Examples include:
68 Public vs. Private Public Cloud Private Cloud Infrastructure Homogeneous Heterogeneous Policy Model Common defined Customized & Tailored Resource Model Shared & Multi-tenant Dedicated Cost Model Economy Model Operational expenditure Large economy of scale Capital expenditure End-to-end control
69 Other types of Clouds Community cloud - The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). Hybrid cloud - The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability.
70 Obstacles To Cloud Computing Data Lock-in No standardized APIs. - OpenStack API Data Confidentiality/Auditability Data transfer bottlenecks/costs Performance unpredictability for systems apps
71 Cloud Computing Summary
72 Challenges for Cloud computing Will all data migrate to the cloud? - The post-pc era. By 2015 some 6.3 exabytes of mobile data will be flowing each month*. How much will end up in the cloud? What will we do with all the new data generated by the Internet of Things and DNA sequencing machines? How will we manage security, ownership and migration of data stored in the cloud? *
73 References NIST (National Institute of Standards and Technology). Dean et al., MapReduce: simplified data processing on large clusters, Comms of ACM, vol 51(1), Armburst et al., Above the Clouds: A Berkeley View of Cloud Computing
74 References Cloud Computing: Principles and Paradigms, R. Buyya et al. (eds.), Wiley, Cloud Computing: Principles, Systems and Applications, L. Gillam et al. (eds.) Springer, Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters in OSDI 2004 Senjay Ghemawat, : The Google File System. SIGOPS Operating Systems Review 37(5), 2003 M. Isard et al.: Dryad: Distributed Data-parallel Programs from Sequential Building Blocks in EuroSys 2007
Intro to Cloud Computing. ID2210 Jim Dowling
Intro to Cloud Computing ID2210 Jim Dowling What Cloud Computing is Not CLOUDS OF CONFUSION AS GALWAY COUNCILLOR TELLS ANOTHER TO GO F..K HIMSELF A GALWAY councillor has refused to apologise for swearing
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationSriram Krishnan, Ph.D. sriram@sdsc.edu
Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
More informationDistributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationIntroduction to Cloud Computing
Discovery 2015: Cloud Computing Workshop June 20-24, 2011 Berkeley, CA Introduction to Cloud Computing Keith R. Jackson Lawrence Berkeley National Lab What is it? NIST Definition Cloud computing is a model
More informationApache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System dhruba@apache.org Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationDesign and Evolution of the Apache Hadoop File System(HDFS)
Design and Evolution of the Apache Hadoop File System(HDFS) Dhruba Borthakur Engineer@Facebook Committer@Apache HDFS SDC, Sept 19 2011 Outline Introduction Yet another file-system, why? Goals of Hadoop
More informationTHE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
More informationCloud Computing using MapReduce, Hadoop, Spark
Cloud Computing using MapReduce, Hadoop, Spark Benjamin Hindman benh@cs.berkeley.edu Why this talk? At some point, you ll have enough data to run your parallel algorithms on multiple computers SPMD (e.g.,
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationWhere We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationNoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
More informationWhat Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming
Distributed and Parallel Technology Datacenter and Warehouse Computing Hans-Wolfgang Loidl School of Mathematical and Computer Sciences Heriot-Watt University, Edinburgh 0 Based on earlier versions by
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationBBM467 Data Intensive ApplicaAons
Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes
More information24/11/14. During this course. Internet is everywhere. Frequency barrier hit. Management costs increase. Advanced Distributed Systems Cloud Computing
Advanced Distributed Systems Cristian Klein Department of Computing Science Umeå University During this course Treads in IT Towards a new data center What is Cloud computing? Types of Clouds Making applications
More informationCloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi
Cloud Platforms, Challenges & Hadoop Aditee Rele Karpagam Venkataraman Janani Ravi Cloud Platform Models Aditee Rele Microsoft Corporation Dec 8, 2010 IT CAPACITY Provisioning IT Capacity Under-supply
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationAnalysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms
Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationCloud Computing Training
Cloud Computing Training TechAge Labs Pvt. Ltd. Address : C-46, GF, Sector 2, Noida Phone 1 : 0120-4540894 Phone 2 : 0120-6495333 TechAge Labs 2014 version 1.0 Cloud Computing Training Cloud Computing
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationHDFS Under the Hood. Sanjay Radia. Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.
HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc. 1 Outline Overview of Hadoop, an open source project Design of HDFS On going work 2 Hadoop Hadoop provides a framework
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More informationCS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationHadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
More informationLoad Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2
Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2 1 PDA College of Engineering, Gulbarga, Karnataka, India rlrooparl@gmail.com 2 PDA College of Engineering, Gulbarga, Karnataka,
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationPart V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts
Part V Applications Cloud Computing: General concepts Copyright K.Goseva 2010 CS 736 Software Performance Engineering Slide 1 What is cloud computing? SaaS: Software as a Service Cloud: Datacenters hardware
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationSession: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October 2013 10:00 Sesión B - DB2 LUW
Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software 22 nd October 2013 10:00 Sesión B - DB2 LUW 1 Agenda Big Data The Technical Challenges Architecture of Hadoop
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationSector vs. Hadoop. A Brief Comparison Between the Two Systems
Sector vs. Hadoop A Brief Comparison Between the Two Systems Background Sector is a relatively new system that is broadly comparable to Hadoop, and people want to know what are the differences. Is Sector
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationUsing Hadoop for Webscale Computing. Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008
Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
More informationEXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics
BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents
More informationScalable Cloud Computing
Keijo Heljanko Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 1/44 Business Drivers of Cloud Computing Large data centers allow for economics
More informationData Centers and Cloud Computing
Data Centers and Cloud Computing CS377 Guest Lecture Tian Guo 1 Data Centers and Cloud Computing Intro. to Data centers Virtualization Basics Intro. to Cloud Computing Case Study: Amazon EC2 2 Data Centers
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationCloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 4 Introduction to Hadoop & GAE Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline Introduction to Hadoop The Hadoop ecosystem Related projects
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More informationCloud Courses Description
Courses Description 101: Fundamental Computing and Architecture Computing Concepts and Models. Data center architecture. Fundamental Architecture. Virtualization Basics. platforms: IaaS, PaaS, SaaS. deployment
More informationHadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationHadoop Introduction. Olivier Renault Solution Engineer - Hortonworks
Hadoop Introduction Olivier Renault Solution Engineer - Hortonworks Hortonworks A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationOverview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationCloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu
Lecture 1 Introduction to Cloud Computing Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu Outline What is cloud computing? How it evolves? What are the
More informationBIG DATA TRENDS AND TECHNOLOGIES
BIG DATA TRENDS AND TECHNOLOGIES THE WORLD OF DATA IS CHANGING Cloud WHAT IS BIG DATA? Big data are datasets that grow so large that they become awkward to work with using onhand database management tools.
More informationCLOUD COMPUTING USING HADOOP TECHNOLOGY
CLOUD COMPUTING USING HADOOP TECHNOLOGY DHIRAJLAL GANDHI COLLEGE OF TECHNOLOGY SALEM B.NARENDRA PRASATH S.PRAVEEN KUMAR 3 rd year CSE Department, 3 rd year CSE Department, Email:narendren.jbk@gmail.com
More informationHow To Understand Cloud Computing
Overview of Cloud Computing (ENCS 691K Chapter 1) Roch Glitho, PhD Associate Professor and Canada Research Chair My URL - http://users.encs.concordia.ca/~glitho/ Overview of Cloud Computing Towards a definition
More informationHadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
More informationMapReduce. Introduction and Hadoop Overview. 13 June 2012. Lab Course: Databases & Cloud Computing SS 2012
13 June 2012 MapReduce Introduction and Hadoop Overview Lab Course: Databases & Cloud Computing SS 2012 Martin Przyjaciel-Zablocki Alexander Schätzle Georg Lausen University of Freiburg Databases & Information
More informationBig Data Analytics: Hadoop-Map Reduce & NoSQL Databases
Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationBig Data Technology Core Hadoop: HDFS-YARN Internals
Big Data Technology Core Hadoop: HDFS-YARN Internals Eshcar Hillel Yahoo! Ronny Lempel Outbrain *Based on slides by Edward Bortnikov & Ronny Lempel Roadmap Previous class Map-Reduce Motivation This class
More informationContents. 1. Introduction
Summary Cloud computing has become one of the key words in the IT industry. The cloud represents the internet or an infrastructure for the communication between all components, providing and receiving
More informationViswanath Nandigam Sriram Krishnan Chaitan Baru
Viswanath Nandigam Sriram Krishnan Chaitan Baru Traditional Database Implementations for large-scale spatial data Data Partitioning Spatial Extensions Pros and Cons Cloud Computing Introduction Relevance
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More information