With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016
Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5
Source: [Tutorials Point] Saurabh Singh CSE 5194, Spring 2016
Source: [Tutorials Point] Continuous scaling of traditional database servers fails to meet these requirements! Saurabh Singh CSE 5194, Spring 2016
Types of Structured data : Relational data Semi Structured data : XML data Unstructured data: Word, PDF, Text, Media Logs Challenges with Capturing data Storage Searching Sharing Transfer Analysis Presentation
What is? Open-source library for data-intensive distributed applications based on the framework Allows distributed processing of large datasets across clusters of computers using simple programming models Designed to scale up from single server to thousands of machines, each offering local computation and storage Started in 2005 Currently, a registered trademark of the Apache Software Foundation
Overview Distributed file system, which stands for Distributed Filesystem Designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware Complex Issues Chances of data loss due to machine failures Complication of network programming because it is network based file system
Master/Slave architecture Source: [ Guru]
NameNode Block Placement One replica on local node, second replica on remote rack, third replica on same remote rack and additional replicas (if replication factor > 3) are randomly placed Client read from nearest replica NameNode Failure A single point of failure Transaction logs (EditLogs) are stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/CIFS).
NameNode EditLogs The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage FsImage and EditLog are stored as a file in the NameNode s local file system Secondary NameNode Copies or Merges FsImage and Transaction Log from NameNode to a temporary directory Uploads new FsImage to NameNode and transaction log on NameNode is purged
Heartbeats Data Nodes sends heartbeats to NameNode every 3 seconds NameNode uses heartbeats to detects datanode failure Replication Engine NameNode detects DataNode failures. NameNode: Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes
Storing & Querying Source: [ Guru]
Good usecases for Store large datasets which may be in TB s or PB s or even more Store different variety of data - Structured, Unstructured or Semi-Structured Store data on commodity hardware (Economical) Bad usecases for Low latency data access Huge number of small files Random file access
What is? is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner Requirements Provide simple and powerful programming model Use large clusters of commodity machines Isolate the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance
What is? is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner Requirements Provide simple and powerful programming model Map and reduce paradigm Use large clusters of commodity machines Scale horizontally instead of scaling vertically Isolate the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance Through redundancy and re-execution
Programming model The computation takes a set of key/value pairs input and produces a set of key/value pairs as output. The user of the framework expresses the computation using two functions: Map and Reduce. The Map function takes an input pair and produces a set of intermediate key/value pairs The framework groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function The Reduce function receives an intermediate key I with its set of values and merges them together
Execution details Source: [Jeffrey Dean, Sanjay and Ghemawat]
1 Input data split into M pieces and many instances of the program started 2 One of the instances is the master copy while the rest are considered as workers. In particular, there are M map tasks and R reduce tasks to assign. 3 A worker who is assigned a map task processes the contents of the corresponding input split and generates key/value pairs from the input data and passes each pair to the user-defined Map function 4 Periodically, the buffered pairs are written to local disk and partitioned into R regions by the partitioning function 5 Reduce worker reads the buffered data from the local disks of the map workers, which is then sorted by the intermediate keys so that all occurrences of the same key are grouped together
6 The reduce worker passes the key and the corresponding set of intermediate values to the user s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition 7 When all map tasks and reduce tasks have been completed, the master program returns control to the user program
Cluster Source: [Edureka!] Saurabh Singh CSE 5194, Spring 2016
JobTracker Responsible for taking in requests from a client and assigning TaskTrackers with tasks to be performed Tries to assign tasks to the TaskTracker on the DataNode where the data is locally present If the node fails, assigns the task to another TaskTracker where the replica of the data exists TaskTracker Accepts tasks (Map,Reduce and Shuffle) from the JobTracker. Sends heart beat message periodically to JobTracker to notify that it is alive. Also sends free slots available within it to process tasks Starts and monitors the Tasks and sends progress/status information back to the JobTracker
Example Code Source: [Pietro Michiardi]
Example Execution Source: [Xiaochong Zhang]
Key Features Low-Cost Unreliable Commodity Hardware Extremely Scalable Fault Tolerant Easy to Administer Highly Parallel
JobTracker related issues in Limits scalability: JobTracker responsible for all of Resource management Job and task scheduling Monitoring Availability: if JobTracker fails, all jobs must restart Problem with Resource Utilization: DataNodes may be reserved for Reduce slots even when there is immediate need for those resources to be used as Mapper slots Limitation in running non- Application Problem in performing real-time analysis Problem in running Message-Passing approach Problem in running Ad-hoc query
Source: [Saphana Tutorial]
Source: [Saphana Tutorial] Saurabh Singh CSE 5194, Spring 2016
Improvements with Yarn does efficient utilization of the resource There are no more fixed map-reduce slots Can now run multiple applications in, all sharing a common resource Can even run application that do not follow model Backward compatibility No more JobTracker and TaskTracker needed in 2.0 Instead, we have two daemons Resource Manager Node specific Node Manager
What is? is a distributed column-oriented database built on top of the Horizontally scalable is a data model that is similar to Google s BigTable Designed to provide quick random access to huge amounts of structured data Leverages the fault tolerance provided by the Provides random real-time read/write access to data in
Reading/Writing Data Source: [Tutorials Point] One can store the data in either directly or through Data consumer reads/accesses the data in randomly using sits on top of the and provides read and write access
vs DFS suitable for storing large files Does not support fast individual record lookups Provides high latency batch processing; no concept of batch processing Provides only sequential access of data Database built on top of the Provides fast lookups for larger tables Provides low latency access to single rows from billions of records (Random access) Internally uses Hash tables and provides random access, and it stores the data in indexed files for faster lookups
What is? is data warehouse infrastructure tool to process structured data in Resides on top of to summarize, and makes querying and analyzing easy Initially was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache
Features of It stores schema in a database and processed data into It provides SQL type language for querying called QL or HQL It is familiar, fast, scalable, and extensible
Features of It stores schema in a database and processed data into It provides SQL type language for querying called QL or HQL It is familiar, fast, scalable, and extensible Not a relational database Not a language for real-time queries and row-level updates
Source: [Tutorials Point]
What is? is used to analyze large sets of data representing them as data flows All the data manipulation operations can be performed in using Provides a high-level language known as Latin Various operators are available using which programmers can develop their own functions for reading, writing, and processing data It was originally created at Yahoo
Features of Latin is a procedural language The data model in Apache is nested relational Allows splits in the pipeline Allows developers to store data anywhere in the pipeline Provides operators to perform ETL (Extract, Transform, and Load) functions
Features of Latin is a procedural language The data model in Apache is nested relational Allows splits in the pipeline Allows developers to store data anywhere in the pipeline Provides operators to perform ETL (Extract, Transform, and Load) functions Provides limited opportunity for query optimization Schema is optional. We can store data without designing a schema
Source: [Tutorials Point]
vs Apache uses a language called Latin Latin is a data flow language Latin is a procedural language and it fits in pipeline paradigm Apache can handle structured, unstructured, and semi-structured data uses a language called QL QL is a query processing language QL is a declarative language is mostly for structured data
Requirements Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Requirements Support those applications which need to reuse a working set of data across multiple parallel operations while retaining the scalability and fault tolerance of. iterative machine learning algorithms. interactive data analytics. Be compatible with,, and any storage system. Source: [Lisa Hua]
Ecosystem Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Source: [Lisa Hua]
Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Resilient Distributed Datasets (RDDs) What are RDDs? Read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost Users can explicitly cache an RDD in memory across machines and reuse it in multiple -like parallel operations RDDs achieve fault tolerance through a notion of lineage each RDD object contains a pointer to its parent and information about how the parent was transformed. Hence, if a partition of an RDD is lost, the RDD has sufficient information about how it was derived from other RDDs to be able to rebuild just that partition
Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Resilient Distributed Datasets (RDDs) Construction of RDDs From a file in a shared file system Parallelizing a collection (e.g., an array) transforming an existing RDD changing the persistence(by Cache or Save) of an existing RDD Parallel operations on RDDs The reduce operation combines dataset elements using an associative function to produce a result The collect operation sends all elements of the dataset to the program The foreach passes each element through a user-provided function
Example Code Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Source: [Lisa Hua]
vs Mapreduce Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce Source: [Tudor Lapusan]
Amazon Web Services https://aws.amazon.com/elasticmapreduce/ Cloudera https://cloudera.com/products/apache-hadoop.html Hortonworks http://hortonworks.com/hdp/ IBM http://www.ibm.com/analytics/us/en/technology/hadoop/ MapR https://www.mapr.com/products/apache-hadoop
Jeffrey Dean, Sanjay and Ghemawat : Simplified Data Processing on Large Clusters http://static.googleusercontent.com/media/research.google.com/ zh-cn/us/archive/mapreduce-osdi04.pdf Lisa Hua Overview http: //web.cse.ohio-state.edu/ panda/5194/papers/4o_spark_overview.pdf Saphana Tutorial http://saphanatutorial.com/ how-yarn-overcomes-mapreduce-limitations-in-hadoop-2-0/ Tutorials Point Hbase Overview http://www.tutorialspoint.com/hbase/hbase_overview.htm
Guru http://hadoopguru.blogspot.com/2013/02/ hadoop-distributed-file-system-hdfs.html Core Servlets Tutorial http://www.coreservlets.com/hadoop-tutorial/ Tutorials Point Overview http://www.tutorialspoint.com/apache_pig/apache_pig_overview.htm Tutorials Point Introduction http://www.tutorialspoint.com/hive/hive_introduction.htm
Tutorials Point http://www.tutorialspoint.com/hadoop/hadoop_big_data_overview.htm Xiaochong Zhang Work Structure http://xiaochongzhang.me/blog/wp-content/uploads/2013/05/_ Work_Structure.png Tudor Lapusan Vs http://www.slideshare.net/tudorlapusan/map-reduce-vs-spark Pietro Michiardi Scalable Algorithm Design with Mapreduce http://www.slideshare.net/michiard/ scalable-algorithm-design-with-mapreduce
Edureka! Cluster http://www.slideshare.net/edurekain/ hadoop-20-architecture-hdfs-federation-namenode-high-availability
The End