Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 21

Outline 1. Distributed File Systems Big Data Analytics 1 / 21

1. Distributed File Systems Outline 1. Distributed File Systems Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Making sense of the data Big Data Analytics 1 / 21

1. Distributed File Systems What is Big Data? Big Data is about: Storing and accessing large amounts of (unstructured) data Processing high volume data streams Making sense of the data Predictive technologies Big Data Analytics 1 / 21

1. Distributed File Systems Overview Part III Machine Learning Algorithms Part II Large Scale Computational Models Part I Distributed Database Distributed File System Big Data Analytics 2 / 21

Outline 1. Distributed File Systems Big Data Analytics 3 / 21

Why do we need a Distributed File System? Big Data Analytics 3 / 21

Why do we need a Distributed File System? Read??? - Whole File? - Specific part? Big Data Analytics 6 / 21

Why do we need a Distributed File System? Write??? - Append to the end of the file? - Insert content in the middle? Big Data Analytics 7 / 21

Why do we need a Distributed File System? We want to: Perform multiple parallel reads and writes Have the files available even if one computer crashes (replication) Hide parallelization and distribution details Big Data Analytics 8 / 21

What is a Distributed File System? File Namespace / /home /home/lucas /home/lucas/big_file Big Data Analytics 9 / 21

What is a Distributed File System? File Namespace / /home /home/john /home/john/big_file Big Data Analytics 10 / 21

Examples GFS (Google Inc.) HDFS (Apache Software Foundation) Ceph (Inktank, Red Hat) MooseFS (Core Technology / Gemius) Windows Distributed File System (DFS) (Microsoft) FhGFS (Fraunhofer) GlusterFS (Red Hat) Lustre Ibrix Big Data Analytics 11 / 21

Components A typical distributed filesystem contains the following components Clients - they do the interface with the user Big Data Analytics 12 / 21

Components A typical distributed filesystem contains the following components Clients - they do the interface with the user Chunk nodes - stores chunks of files Big Data Analytics 12 / 21

Components A typical distributed filesystem contains the following components Clients - they do the interface with the user Chunk nodes - stores chunks of files Master node - stores which parts of each file are on which chunk node Big Data Analytics 12 / 21

Distributed File Systems The Google File System Architecture Big Data Analytics 13 / 21

Distributed File Systems - Storing files Master node / /home /home/john /home/john/big_file /home/john/big_file Chunk 1 C1 C7 Chunk 2 C3 C5 Chunk 3 C4 C6 Chunk 4 C2 C8 Chunk 4 Chunk 1 Chunk 2 Chunk 3 C1 C2 C3 C4 C5 C6 C7 C8 Big Data Analytics 14 / 21

Read Example Client Application 1. read(/home/john/big_file, chunk 1) 2. (Chunk 1 handle, {C1, C7}) Master node / /home /home/john /home/john/big_file /home/john/big_file Chunk 1 C1 C7 Chunk 2 C3 C5 Chunk 3 C4 C6 Chunk 4 C2 C8 3. (Chunk 1 handle, byte range) C1 C2 C3 C4 C5 C6 C7 C8 4. Chunk 1 data Big Data Analytics 15 / 21

Write Example Make sure each replica contains the same data all the time Big Data Analytics 16 / 21

Write Example Make sure each replica contains the same data all the time One replica is designated to be the primary replica Big Data Analytics 16 / 21

Write Example Make sure each replica contains the same data all the time One replica is designated to be the primary replica Master pings the nodes to make sure they are alive Big Data Analytics 16 / 21

Write Example 1. write(/home/john/big_file, chunk 1) Master node / Chunk 1 /home/john/big_file C1 C7 Client Application 2. (Chunk 1 handle, {C1, C7}) /home /home/john Chunk 2 Chunk 3 C3 C5 C4 C6 /home/john/big_file Chunk 4 C2 C8 3. (Chunk 1 handle, data) C1 C2 C3 C4 C5 C6 C7 C8 6. done 4. (Chunk 1 handle, offset) 5. Return status (success or failure) Big Data Analytics 17 / 21

Considerations Reads are very efficient operations Big Data Analytics 18 / 21

Considerations Reads are very efficient operations Writes are efficient if they are appends to the end of the file Big Data Analytics 18 / 21

Considerations Reads are very efficient operations Writes are efficient if they are appends to the end of the file Write in the middle of a file can be problematic Big Data Analytics 18 / 21

Considerations Reads are very efficient operations Writes are efficient if they are appends to the end of the file Write in the middle of a file can be problematic Primary replica decides the order in which to make writes: Data is always consistent in all replicas Big Data Analytics 18 / 21

GFS vs. HDFS HDFS GFS Chunk Size 128Mb 64Mb Default replicas 2 Files (data and 3 Chunknodes generation stamp) Master NameNode GFS Master Chunk Nodes DataNode Chunk Server Big Data Analytics 19 / 21

Google File System Big Data Analytics 20 / 21

Hadoop Distributed File System Big Data Analytics 21 / 21