Distributed Computing Case Study

Size: px

Start display at page:

Download "Distributed Computing Case Study"

Aron Sharp
7 years ago
Views:

1 Distributed Computing Case Study

2 Outline What is distributed computing Case study Hadoop HDFS and map reduce Gluster File System

3 What is Distributed Computing/System? Distributed computing A field of computing science that studies distributed system. The use of distributed systems to solve computational problems. Distributed system Wikipedia There are several autonomous computational entities, each of which has its own local memory. The entities communicate with each other by message passing. Operating System Concept The processors communicate with one another through various communication lines, such as high-speed buses or telephone lines. Each processor has its own local memory.

4 What is Distributed Computing/System? Distributed program A computing program that runs in a distributed system Distributed programming The process of writing distributed program

5 What is Distributed Computing/System? Common properties Fault tolerance When one or some nodes fails, the whole system can still work fine except performance. Need to check the status of each node Each node plays partial role Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input. Resource sharing Each user can share the computing power and storage resource in the system with other users Load Sharing Dispatching several tasks to each nodes can help share loading to the whole system. Easy to expand We expect to use few time when adding nodes. Hope to spend no time if possible.

6 CASE STUDY - HADOOP

7 Quick overview Features HDFS Map-Reduce Framework Paramount Q

8 Features Large files Gigabytes, Terabytes Write once, read many Commodity Hardware

9 HDFS Namenode: manages the file system namespace and regulates access to files by clients. determines the mapping of blocks to DataNodes. fsimage and editlog Data Node : manage storage attached to the nodes that they run on save CRC codes send heartbeat to namenode. Each data is split as a chunk and each chuck is stored on some data nodes.

10 HDFS Secondary Namenode responsible for merging fsimage and EditLog Not a namenode

11 HDFS architecture

12 Secondary namenode Edit log Transaction log Update transaction log before updating content in memory Always update this file when each request has been sent to namenode fsimage Persistent checkpoint Secondary namenode Responsible for merging editlog and fsimage.

13 Secondary namenode From Hadoop - The Definitive Guide

14 Map-Reduce Framework JobTracker Responsible for dispatch job to each tasktracker Job management like removing and scheduling. TaskTracker Responsible for executing job. Usually tasktracker launch another JVM to execute the job.

15 Map-Reduce Framework From Hadoop - The Definitive Guide

16 Summary - Hadoop Hadoop provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster.

17 CASE STUDY GLUSTER FILESYSTEM

18 Quick overview Introduction Gluster File system design Example : 4 nodes GlusterFS GlusterFS Cluster File System Paramount Q

19 Introduction GlusterFS is an open source clustered file system and runs on industry standard hardware from any vendor and delivers multiple times the scalability and performance of conventional storage at a fraction of the cost. + + = N x Performance & Capacity

20 GlusterFS Overview From GlusterFS Datasheet

GlusterFS Design Storage Clients Cluster of Clients (Supercomputer, Data Center) Compatibility with MS Windows and other Unices GLFS Client Clustered GLFS Client Vol Manager GlusterFS Client

21 GlusterFS Design Storage Clients Cluster of Clients (Supercomputer, Data Center) Compatibility with MS Windows and other Unices GLFS Client Clustered GLFS Client Vol Manager GlusterFS Client Clustered Vol Manager Clustered Clustered Vol Manager I/O Scheduler Clustered I/O Scheduler Clustered I/O Scheduler GLFS Client Clustered GLFS Client Vol Manager GlusterFS Client Clustered Vol Manager Clustered Clustered Vol Manager I/O Scheduler Clustered I/O Scheduler Clustered I/O Scheduler GigE NFS / SAMBA over TCP/IP Storage Gateway Storage Storage NFS/Samba Gateway Gateway NFS/Samba NFS/Samba GLFS Client GLFS Client GLFS Client InfiniBand RDMA (or) TCP/IP GlusterFS Clustered Filesystem on x86-64 platform Storage Brick 1 GlusterFS Volume From Storage Brick 2 GlusterFS Volume Storage Brick 3 GlusterFS Volume Storage Brick GLFSD4 GLFSD Volume GlusterFS Volume Volume

22 Key Design Considerations Capacity Scaling Scalable beyond Peta Bytes I/O Throughput Scaling Pluggable Clustered I/O Schedulers Advantage of RDMA transport Reliability Non Stop Storage Ease of Manageability Self Heal NFS like Disk Layout Elegance in Design Stackable Modules Not tied to I/O Profiles or Hardware or OS

23 Translators Performance translators 1. Read Ahead 2. Write Behind 3. Threaded I/O 4. IO-Cache Clustering translators 1. Automatic File Replication (AFR) 2. Stripe 3. Unify Scheduling translators 1. Adaptive Least Usage (ALU) 2. Non-uniform filesystem architecture (NUFA) 3. Random 4. Rand-Robin

24 FUSE What s FUSE? Stands for File system in USErspace Makes it easy to write new filesystems 1. without knowing how the kernel works 2. without breaking unrelated things 3.more quickly/easily than traditional file systems built as a kernel module

25 FUSE structure From

26 How FUSE Works Application makes a file-related syscall Kernel figures out that the file is in a mounted FUSE filesystem The FUSE kernel module forwards the request to your userspace FUSE app Your app tells FUSE how to reply

27 Example : 4 nodes GlusterFS User User User Virtual Machine (XEN + KVM) Web App. MySQL Global Name Space ( /mnt/glusterfs ) TCPIP GigE GlusterFS Server Server GlusterFS Server Server GlusterFS Server Server GlusterFS Server Server POSIX POSIX POSIX POSIX Ext4 Ext3 XFS Ext3 vlab01 vlab02 vlab03 vlab04 Storage Virtualization : GlusterFS (AFR + Unify) ~1.8TB

28 The view of GlusterFS client $ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 901G 115G 740G 14% / tmpfs 4.0G 0 4.0G 0% /dev/shm /etc/glusterfs/glusterfs.vol 1.8T 243G 1.6T 13% /mnt/glusterfs

29 The view of GlusterFS server BRICK1 BRICK2 BRICK3 Unify Volume work.ods corporate.odp driver.c Mirror Volume accounts-2007.db backup.db.zip accounts-2006.db Stripe Volume north-pole-map dvd-1.iso xen-image benchmark.pdf test.ogg test.m4a initcore.c accounts-2007.db backup.db.zip accounts-2006.db north-pole-map dvd1.iso xen-image mylogo.xcf driver.c ether.c accounts-2007.db backup.db.zip accounts-2006.db north-pole-map dvd1.iso xen-image

30 Summary - GlusterFS GlusterFS clusters together storage building blocks, aggregating disk and memory resources and managing your data in a single global namespace. GlusterFS is based on a stackable architecture that can be optimized for specific application profiles with simple plug-in modules, optimizing performance for a wide range of workloads.

31 Reference Tom White - Hadoop - The Definitive Guide Silberschatz Galvin - Operating System Concepts

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,