BBM467 Data Intensive ApplicaAons

Hace7epe Üniversitesi Bilgisayar Mühendisliği Bölümü BBM467 Data Intensive ApplicaAons Dr. Fuat Akal akal@hace7epe.edu.tr

Problem How do you scale up applicaaons? Run jobs processing 100 s of terabytes of data Takes 11 days to read on 1 computer Need lots of cheap computers Fixes speed problem (15 minutes on 1000 computers), but Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure Must be efficient and reliable

SoluAon: Hadoop Open Source Apache Project An open- source soaware framework for storage and large scale processing of datasets on clusters of commodity hardware. Hadoop Core includes: Distributed File System - distributes data Map/Reduce - distributes applicaaon Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware

Recall: Commodity Hardware Cluster Typically in 2 level architecture Nodes are commodity PCs 40 nodes/rack Uplink from rack is 8 gigabit Rack- internal is 1 gigabit

Hadoop Hadoop was created by Doug Cuing and Mike Cafarella in 2005. Cuing, who was working at Yahoo! at the Ame, named it aaer his son's toy elephant. "Being a guy in the soaware business, we're always looking for names," Cuing said. "I'd been saving it for the right Ame."

Example Hadoop Cluster ~20,000 machines running Hadoop Largest clusters are currently 2000 nodes Several petabytes of user data Hundreds of thousands of jobs every month Numbers are probably outdated and growing.

Who Uses Hadoop? Amazon AOL Facebook Fox InteracAve Media Google IBM New York Times Yahoo! More at h7p://wiki.apache.org/hadoop/poweredby

Hadoop Ecosystem Ambari : A web- based tool for provisioning, managing, and monitoring Apache Hadoop. Avro : A data serializaaon system. Cassandra : A scalable mula- master database with no single points of failure. Chukwa : A data collecaon system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarizaaon and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high- level data- flow language and execuaon framework for parallel computaaon. ZooKeeper : A high- performance coordinaaon service for distributed applicaaons.

Hadoop Components Hadoop Distributed file system (HDFS) Single namespace for enare cluster Replicates data for fault- tolerance MapReduce framework Executes user jobs specified as map and reduce funcaons Manages work distribuaon & fault- tolerance

HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault- tolerant and is designed to be deployed on low- cost hardware. HDFS provides high throughput access to applicaaon data and is suitable for applicaaons that have large data sets.

AssumpAons and Goals DetecAon of hardware faults and quick, automaac recovery from them. Designed more for batch processing rather than interacave use by users. Support for large files. Follows a write- once- read- many access model for files. Provides interfaces for applicaaons to move themselves closer to where the data is located. Designed to be easily portable from one plavorm to another.

HDFS Pursues a master- slave Model NameNode Executes file system namespace operaaons like opening, closing, and renaming files and directories. Determines the mapping of blocks to DataNodes. DataNodes Manage a7ached storage. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The DataNodes are responsible for serving read and write requests from the file system s clients. The DataNodes also perform block creaaon, deleaon, and replicaaon upon instrucaon from the NameNode.

HDFS Architecture h7p://hadoop.apache.org/docs/current/hadoop- project- dist/hadoop- hdfs/hdfsdesign.html

The existence of a single NameNode in a cluster greatly simplifies the architecture of the system?

Data ReplicaAon HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replicaaon factor are configurable per file. An applicaaon can specify the number of replicas of a file. The replicaaon factor can be specified at file creaaon Ame and can be changed later. Files in HDFS are write- once and have strictly one writer at any Ame. The NameNode makes all decisions regarding replicaaon of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is funcaoning properly. A Blockreport contains a list of all blocks on a DataNode.

Block Placement Files are split into fixed sized blocks and stored on data nodes (Default 64MB) Data blocks are replicated for fault tolerance and fast access (Default is 3) Where to put a given block by default? First copy is wri7en to the node creaang the file (write affinity) Second copy is wri7en to a data node within the same rack Third copy is wri7en to a data node in a different rack

Replica SelecAon To minimize global bandwidth consumpaon and read latency, HDFS tries to saasfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to saasfy the read request. If an HDFS cluster spans mulaple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

Disk Failures Each DataNode sends a Heartbeat message to the NameNode periodically. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more. DataNode death may cause the replicaaon factor of some blocks to fall below their specified value. The NameNode constantly tracks which blocks need to be replicated and iniaates replicaaon whenever necessary.

Re- replicaaon The necessity for re- replicaaon may arise due to many reasons: a DataNode may become unavailable. a replica may become corrupted. a hard disk on a DataNode may fail. the replicaaon factor of a file may be increased.

Cluster Rebalancing The HDFS architecture is compaable with data rebalancing schemes. A scheme might automaacally move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. In the event of a sudden high demand for a paracular file, a scheme might dynamically create addiaonal replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

The Workflow Load data into the Cluster (HDFS writes) Analyze the data (MapReduce) Store results in the Cluster (HDFS) Read the results from the Cluster (HDFS reads)

MoAvaAon for MapReduce Scalability to large data volumes: Scan 100 TB on 1 node @ 50 MB/s = 24 days Scan on 1000- node cluster = 35 minutes Cost- efficiency Commodity nodes (cheap, but unreliable) Commodity network AutomaAc fault- tolerance (fewer admins) Easy to use (fewer programmers) MapReduce Architecture provides AutomaAc ParallelizaAon and DistribuAon Fault Tolerance I/O Scheduling Monitoring and Status Updates

Distributed Grep Very big data Split data Split data Split data grep grep grep matches matches matches cat All matches Split data grep matches

Distributed Word Count Very big data Split data Split data Split data count count count count count count merge merged count Split data count count

Map + Reduce Very big data M A P ParAAoning FuncAon R E D U C E Result Map: Accepts input key/value pair Emits intermediate key/ value pair Reduce : Accepts intermediate key/value* pair Emits output key/value pair

MapReduce Model Two primiave operaaons map: (k 1,v 1 ) à list(k 2,v 2 ) reduce: (k 2,list(v 2 )) à list(k 3,v 3 ) Each map operaaon processes one input key/value pair and produces a set of key/value pairs Each reduce operaaon Merges all intermediate values (produced by map ops) for a paracular key Produce final key/value pairs OperaAons are organized into tasks Map tasks: apply map operaaon to a set of key/value pairs Reduce tasks: apply reduce operaaon to intermediate key/value pairs Each MapReduce job comprises a set of map and reduce (opaonal) tasks.

MapReduce Model

Word Count Example

An OpAmizaAon: The Combiner Local aggregaaon funcaon for repeated keys produced by same map For associaave operators like sum, count, max Decreases size of intermediate data

Word Count Example (Combiner) Apple, 2 Apple, 1 Apple, 1 Apple, 2 Plum, 1

Joining Two Large Datasets

Joining Large and Small Datasets

Fault Tolerance Intermediate data between mappers and reducers are materialized to simple & straighvorward fault tolerance What if a task fails (map or reduce)? Tasktracker detects the failure Sends message to the jobtracker Jobtracker re- schedules the task What if a datanode fails? Both namenode and jobtracker detect the failure All tasks on the failed node are re- scheduled Namenode replicates the users data to another node What if a namenode or jobtracker fails? The enare cluster is down

Fault Tolerance If a task is going slowly (straggler): Launch second copy of task on another node Take the output of whichever copy finishes first, and kill the other one CriAcal for performance in large clusters ( everything that can go wrong will )

Oracle Big Data Appliance h7p://www.cloudera.com

Teradata Aster Big AnalyAcs Appliance h7p://hortonworks.com/

Good StarAng Point for Hadoop Hortonworks provides a sandbox which is provided as a self- contained virtual machine. No data center, no cloud service and no internet connecaon needed! h7p://hortonworks.com/products/hortonworks- sandbox/

Acknowledgement - 1 The course material used for this class is mostly taken and/or adopted* from the course materials of the Big Data class given by Nesime Tatbul and Donald Kossmann at ETH Zurich (h7p://www.systems.ethz.ch/). (*) Original course material is reduced somehow to fit the needs of BBM467. Therefore, original slides were not used as they are.

Acknowledgement - 2 Some material used for this lecture is taken and/or adopted from the course materials of Matei Zaharia from UC Berkeley RAD Lab. Owen O Malley from Yahoo. the course materials of Jerome Mitchell from Indiana University Xuanhua Shi, Dr. Bing Chen and Shadi Ibrahim.