Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Similar documents

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Chase Wu New Jersey Ins0tute of Technology

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Accelerating and Simplifying Apache

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop. Sunday, November 25, 12

Application Development. A Paradigm Shift

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop IST 734 SS CHUNG

How To Scale Out Of A Nosql Database

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Large scale processing using Hadoop. Ján Vaňo

Big Data and Apache Hadoop s MapReduce

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem B Y R A H I M A.

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Open source Google-style large scale data analysis with Hadoop

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Management and NoSQL Databases

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CSE-E5430 Scalable Cloud Computing Lecture 2

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

BBM467 Data Intensive ApplicaAons

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Introduction to Hadoop

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Distributed File Systems

Distributed File Systems

Introduction to Hadoop

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

HDFS. Hadoop Distributed File System

Introduction to HDFS. Prasanth Kothuri, CERN

MapReduce with Apache Hadoop Analysing Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Snapshots in Hadoop Distributed File System

A Survey on Cloud Storage Systems

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data With Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Open source large scale distributed data management with Google s MapReduce and Bigtable

HDFS Architecture Guide

Load Rebalancing for File System in Public Cloud Roopa R.L 1, Jyothi Patil 2

Apache Hadoop: Past, Present, and Future

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Distributed File Systems

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop Distributed File System (HDFS)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

HDFS Space Consolidation

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Federated Cloud-based Big Data Platform in Telecommunications

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Architecture. Part 1

Introduction to Cloud Computing

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Application Programming

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

A Brief Outline on Bigdata Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop & its Usage at Facebook

Apriori-Map/Reduce Algorithm

The Google File System

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA What it is and how to use?

Transcription:

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul et. Al. Demetris Zeinalipour http://www.cs.ucy.ac.cy/~dzeina/courses/epl646 14-1

Lecture Outline Introduction to Cloud Computing Typical Datacenters, Cloud Stack and Buzzwords, Public / Private Clouds, Utility Computing, Killer Apps, Economic Model Distributed System Basics I/O Performance Replication Strategies The Hadoop Project (Core, HDFS, Map-Reduce, HBase, HIVE) The Hadoop Distributed File System (HDFS) HDFS vs. NFS (Network File System) HDFS Example Deployments (Yahoo, Facebook) 14-2

Cloud Computing αγαθά Google's Datacenter in Oregon Microsoft Azure in Chicago 14-3

Cloud Computing (Datacenters) Datacenter 14-4

Cloud Computing (Cloud Stack) 14-5

Cloud Computing (*aas Buzz Word!) 14-6

Cloud Computing (Public vs. Private) Based on: Above the Clouds: A Berkeley View of Cloud Computing, RAD Lab, UC Berkeley 14-7

Cloud Computing (Utility Computing Yπ. Ωφελείας) NewSQL-as-a-Service To Amazon RDS* (Relational Database Service) 963$ / year 27,165 $ / year (*essentially MySQL running on Amazon EC2 Elastic Computing Cloud) 14-8

Cloud Computing (Utility Computing Yπ. Ωφελείας) 14-9

Cloud Computing (Virtualization - Tεχνολογία εικονικών συστημάτων) 14-10

Cloud Computing (Economic Model) 14-11

Cloud Computing (Killer Apps: OLTP/OLAP) 14-12

Cloud Computing (Killer Apps: escience) 14-13

Our IaaS Private Cloud 14-14

Distributed Systems Basics (I/O Performance) Each Disk: 2TB 7,200 rpm 6Gb SAS (Serial Attach. SCSI) 3.5" HDD In RAID-5 configuration [10Gbps FCoE also available] 14-15

Distributed Systems Basics (I/O Performance) (throughput) 14-16

Distributed Systems Basics (I/O Performance) (Same as Parallel but more scalable) 14-17

Distributed Systems Basics (I/O Performance) 14-18

Distributed Systems Basics (Replication Strategies) NOSQL DBMSs SQL RDBMSs 14-19

Distributed Systems Basics (Replication Strategies) (Recall conflict resolution in CouchDB) 14-20

Terminology (MR => HADOOP => HBASE) Map-Reduce: a programming model for processing large data sets. Invented by Google! "MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation,San Francisco, CA, December, 2004." Can be implemented in any language (recall javascript Map- Reduce we used in the context of CouchDB). Hadoop: Apache's open-source software framework that supports data-intensive distributed applications Derived from Google's MapReduce + Google File System (GFS) papers. Enables applications to work with thousands of computationindependent computers and petabytes of data. Download: http://hadoop.apache.org/ 14-22

Terminology (MR => HADOOP => HBASE) Hadoop Project Modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides highthroughput access to application data. Hadoop YARN (Yet Another Resource Negotiator): A framework for job scheduling and cluster resource management. Hadoop MapReduce (MapReduce v2.0): A YARN-based system for parallel processing of large data sets. (Next Lectures) Other Hadoop-related projects at Apache include: Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase (Hadoop Database): A scalable, distributed database that supports structured data storage for large tables. (Next Lectures) Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. (Next Lectures) ZooKeeper : A high-performance coordination service for distributed applications. 14-23

Large-Scale File Systems (GFS => HDFS) Problem: if nodes can fail, how can we store data persistently? Answer: Distributed File System (global file namespace) The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. 14-24

Network File Systems (NFS=>GFS=>HDFS) UNIX NFS (Network File System): nfsd (deamon) mounts remote folders to a UNIX host (/etc/fstab). 14-25

NFS uses a Client/Server Architecture that is a single point of failure by default. Network File Systems (NFS=>GFS=>HDFS) 14-26

Large-Scale File Systems (Hadoop File System - HDFS) Chuck (Block) Size: 4KB-32KB NFS File Size Limit = 2GB Default Chunk Size: 64 MB Unlimited File Size (21PB by Facebook) 3x replicas per chunk 14-27

Large-Scale File Systems (Hadoop File System - HDFS) 2010 2010 HDFS scalability: the limits to growth http://static.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf 14-29

Large-Scale File Systems (Hadoop File System - HDFS) Namespace lookup are fast (1 Master enough!) [ 1GB Metadata = 1PB Data ] In NFS Metadata + Transfers going through same server => Not Scalable HDFS designed for unreliable hardware (2-3 failures / 1000 nodes / day) New Hardware: 3x more unreliable!!! 14-30

Large-Scale File Systems (Hadoop File System - HDFS) 14-31

User Interface API Java API C language wrapper (libhdfs) for the Java API is also avaiable POSIX like command hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir myfile.txt HDFS Admin bin/hadoop dfsadmin safemode bin/hadoop dfsadmin report bin/hadoop dfsadmin -refreshnodes Web Interface Ex: http://localhost:50070 14-32

Web Interface (http://172.16.203.136:50070) 14-33

Web Interface (http://localhost:50070) Browse the file system Classification 14-34

POSIX Like command 14-35

Latest API Java API http://hadoop.apache.org/core/docs/current/api/ 14-36