Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Transcription

1 Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster Fang (Cherry) Liu, PhD fang.liu@oit.gatech.edu A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

2 Targets for this session Target audience: HPC system admins who wants to support Hadoop cluster Points of interests: Big Data is Common Challenges and Tools Hadoop vs HPC Hadoop Core Hadoop EcoSystem Core parts HDFS and Mapreduce Projects Hadoop Basic Configure Hadoop Cluster on HPC cluster PACE Hadoop Cluster Hadoop Advance May

3 Big Data is Common The rendering of Avatar reportedly requires over 1 Petabyte of storage space according to BBC s Clickbits, which is the equivalent of 500 hard drives of 2TB each. That s equal to a 32 year long MP3 file. The compu@ng core 34 racks, each with four chassis of 32 machines each adds up to some 40,000 processors and 104 terabytes of RAM. hzp://thenextweb.com/2010/01/01/avatar- takes- 1- petabyte- storage- space- equivalent- 32- year- long- mp3/ Facebook s data warehouses grow by Over half a petabyte every 24 hours (2012) hzp:// facebook_open_sources_corona/ An average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day (2009). hzp://dl.acm.org/cita@on.cfm? id= &dl=acm&coll=dl&cfid= &cftoken= May

4 Big Data: Three challenges Volume (Storage) The size of the data Velocity (Processing speed) The latency of data processing to the growing demand of Variety (structure vs. unstructured data) The diversity of sources, formats, quality, structures May

5 What are Tools to use? Hadoop is an open- source sonware for reliable, scalable, distributed Created by Doug Cuong and Michael Cafarella while at Yahoo, named aner Doug s son s toy elephant WriZen in Java Scale to thousands of machines with linear scalability Uses simple programming model (MapReduce) Fault tolerant (HDFS) Spark it is a fast and general engine for large- scale data processing, it claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. It provides high- level tools including Spark SQL, Mllib for machine learning, GraphX, and Spark Streaming. It runs on Hadoop, Mesos, standalone, or in the cloud. It can access data sources including HDFS, Cassandra, Hbase, Hive, and Amazon S May

6 Hadoop Vs. Conversional HPC Hadoop HPC A cluster of machines collocate the data with the compute node Move computa@on to data MapReduce operates at the higher level, data flow is implicit Fault tolerance through data replica@ons, easier to rerun the Map or Reduce tasks. Restrict to data- processing problem A cluster of machines access a shared filesystem Move data to computa@on, network bandwidth is the bozleneck Message Passing Interface (MPI) explicitly handle the mechanics of the data flow Needs to explicitly manage checkpoin@ng and recovery Can solve more complex algorithm 4-8 August

7 Hadoop Core Characteris.cs - instead of building one big supercomputer, storage and processing are spread across a cluster of smaller machines that communicate and work together. Horizontal scalability - it is easy to extend a Hadoop cluster by just adding new machines. Every new machine increases total storage and processing power of the Hadoop cluster. Fault- tolerance - Hadoop con@nues to operate even when a few hardware or sonware components fail to work properly. Cost- op@miza@on - Hadoop runs on standard hardware; it does not require expensive servers. Programming abstrac@on - Hadoop takes care of all messy details related to distributed compu@ng. Using a high- level API, users can focus on implemen@ng business logic that solves their real- world problems. Data locality don t move large datasets to where applica@on is running, but run the applica@on where the data already is May

8 Hadoop EcoSystem Akaban Workflow Hue Oozie Analysis Pig Scalding Mahout Hive Impata MapReduce HBase HDFS disk disk disk disk disk disk disk May

9 Hadoop EcoSystem Core part HDFS: Hadoop Distribute File System (HDFS) is a distributed file system designed to run on a commodity cluster of machines. HDFS is highly fault tolerant and is useful for processing large data sets. MapReduce: MapReduce is a sonware framework for processing large data sets, petabyte scale, on a cluster of commodity hardware. When MapReduce jobs are run, Hadoop splits the input and locates the nodes on the cluster. The actual jobs are then run at or close to the node where the data is residing so that the data is as close to the computa@on node as possible. This avoids transfer of huge amount of data across the network so that the network does not become a bozleneck or get flooded May

10 Hadoop EcoSystem - Distribu.ons Apache: Purely Open Source distribu@on of Hadoop maintained by the community at Apache Sonware Founda@on. Cloudera: Cloudera s distribu@on of Hadoop that is built on top of Apache Hadoop. The distribu@on includes capabili@es such as management, security, high availability and integra@on with a wide variety of hardware and sonware solu@ons. Cloudera is the leading distributor of Hadoop. Horton Works: This also builds on the Open Source Apache Hadoop with claims to enterprise readiness. It also claims to be the only distribu@on that is available for Windows servers. MapR: Hadoop distribu@on with some unique features, most notably the ability to mount the Hadoop cluster over NFS. Amazon EMR : Amazon s hosted version of MapReduce is called Elas@c Map Reduce. This is part of the Amazon Web Services (AWS). EMR allows a Hadoop cluster to be deployed and MapReduce jobs to be run in the AWS cloud with just a few clicks May

11 Hadoop EcoSystem - Related Projects Pig : A high level language to analyzing large data sets which eases development of MapReduce jobs. Hundreds of lines of code can be wrizen with just few lines of Pig. At Yahoo > 60% of Hadoop usage is on Pig. Hive : Hive is a data warehouse framework that stores querying of large data sets stored in Hadoop. Hive provides a high- level SQL like language called HiveQL. HBase : HBase is a distributed scalable data store based on Hadoop. HBase is a distributed, versioned, column- oriented database modeled aner Google s BigTable. Mahout : Mahout is a scalable Machine learning library. Mahout u@lizes Hadoop to achieve massive scalability. YARN : YARN is the next genera@on of MapReduce a.k.a MapReduce 2. The MapReduce framework was overhauled using YARN to overcome the scalability bozlenecks in earlier version of MapReduce when it was run over a very large cluster(thousands of nodes). Ozzie :. Ozzie is a workflow scheduler system that eases the crea@on and management of the sequence of MapReduce jobs. Flume : A distributed, reliable and available service for collec@ng, aggrega@ng and moving log data to HDFS. This is typically useful in systems where log data needs to be moved to HDFS periodically for processing May

12 Hadoop Basics - HDFS HDFS is designed to store a very large amount of informa@on (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS. HDFS should store data reliably. If individual machines in the cluster malfunc@on, data should s@ll be available. HDFS should provide fast, scalable access to this informa@on. It should be possible to serve a larger number of clients by simply adding more machines to the cluster. HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible May

13 Hadoop Basics HDFS (Cont.) May

14 Hadoop Basics - MapReduce Mapping and reducing tasks run on nodes where individual records of data are already present May

15 Example : Word Count May

16 Basic HDFS Commands Create a directory in HDFS hadoop fs mkdir <paths> (e.g. hadoop fs mkdir /user/hadoop/dir1) List files hadoop fs ls <args> (e.g. hadoop fs ls /user/hadoop/dir1) Upload data from local system to HDFS hadoop fs - put <localsrc>... <HDFS_dest_Path> (e.g. hadoop fs put ~/ foo.txt /user/hadoop/dir1/foo.txt) Download file from HDFS hadoop fs - get <hdfs_src> <localdst> (e.g. hadop fs get /user/hadoop/dir1/ foo.txt /home/) Check how much space u@liza@on in a HDFS dir hadoop fs du URI (e.g. hadoop fs du /user/hadoop ) Get help hadoop fs - help May

17 Case Study : PACE Hadoop cluster Consists of 5x24- core Altus 2704 servers, each with 128 GB of RAM. Each node has 3TB local disk Raid (6x500GB) All nodes are connected via 40 GB/sec IB connec@on There is one node serving as NameNode + JobTracker + DataNode and named as hadoop- nn1 The rest of four nodes serving as DataNode namely Hadoop- dn1 Hadoop- dn2 Hadoop- dn3 Hadoop- dn4 Check the cluster status and running jobs at: Overview hzp://<namenode Full Qualified Name>:50070 JobTracker hzp://<namenode Full Qualified Name>: May

18 Installing Hadoop on academic cluster Download release from official website hzp://hadoop.apache.org/, most recent release is at April 21, Puong the binary on NFS so that all nodes can access. Adding the local disk to each node to serve as HDFS file system Configuring the nodes into name nodes and data nodes. 4-8 August

19 Configuring HDFS All are ${HADOOP_PREFIX}/etc/hadoop (e.g. /usr/ local/packages/hadoop/2.6/etc/hadoop) In core- site.html Key Value Example Fs.defaultFS Protocol:// servername:port Hdfs:// :9000 Hadoop.tmp.dir Pathname /dfs/hadoop/tmp In hdfs- site.html Key Value Example Dfs.name.dir Pathname /dfs/hadoop/name Dfs.data.dir Pathname /dfs/hadoop/data Number of August

20 Configuring Yarn In ${HADOOP_PREFIX}/yarn- site.xml <property> <name>yarn.resourcemanager.resource- tracker.address</name> <value>hdfs://<hostname>:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hdfs://<hostname>:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hdfs://<hostname>:8040</value> </property> 4-8 August

21 Configure Slaves file ${HADOOP_HOME}/etc/ hadoop/slaves hadoop- nn1: hadoop- nn1 hadoop- dn1 hadoop- dn2 hadoop- dn3 hadoop- dn4 hadoop- dn1: hadoop- dn1 hadoop- dn2: hadoop- dn2 hadoop- dn3: hadoop- dn3 hadoop- dn4: hadoop- dn4 DataNodes Hadoop- dn1 Hadoop- dn2 Hadoop- dn3 Hadoop- dn4 NameNode Hadoop- nn1 4-8 August

22 Configure environment Add following two lines in ${HADOOP_PREFIX}/etc/hadoop/hadoop- env.sh export JAVA_HOME=/usr/local/packages/java/1.7.0 export HADOOP_LOG_DIR=/nv/ap2/logs/hadoop Add following line in ${HADOOP_PREFIX}/etc/hadoop/yarn- env.sh export YARN_LOG_DIR=/nv/ap2/logs/hadoop Add following line in ${HADOOP_PREFIX}/etc/hadoop/mapred- env.sh export HADOOP_MAPRED_LOG_DIR=/nv/ap2/logs/hadoop 4-8 August

23 Star.ng HDFS Create a user named hadoop, and start all hadoop services with hadoop user to avoid security issue. Format the file system based on the above configura@on: ${HADOOP_HOME}/bin/hdfs namenode format <cluster_name> Start the HDFS start- dfs.sh Make sure its running correctly: ps ef grep I hadoop Start the Yarn start- yarn.sh Make sure its running correctly : ps ef grep I resourcemanager 4-8 August

24 User management on Hadoop Cluster Login as hadoop user ssh nn1 Create a directory for given user Hadoop fs mkdir /user/userid Set directory ownership to given user Hadoop fs chown R userid:groupid /user/userid Change the permission to user only Hadoop fs chmod R 700 /user/userid 4-8 August

25 User runs a job User login to Hadoop NameNode, and submit job from there ssh userid@hadoop- nn1 Load Hadoop environment variables, such as Jave, python, HADOOP_PREFIX, HADOOP_YARN_HOME Module load hadoop/2.6.0 Upload input file from local file system to hadoop fs put example.txt /user/userid Run wordcount on example.txt hadoop jar /usr/local/packages/hadoop/2.6.0/share/hadoop/mapreduce/ hadoop- mapreduce- examples jar wordcount /user/userid/ /user/userid/ testout Check the result hadoop fs cat /user/userid/testout > output 4-8 August

26 Stopping HDFS Stops resource manager stop- yarn.sh Stops name nodes and data nodes stop- hdfs.sh Note: may require manually shutdown on each individual data nodes. 4-8 August

27 Hadoop Cluster overview 4-8 August

28 Hadoop DataNodes 4-8 August

29 Hadoop Job Tracker History 4-8 August

30 References: Yahoo Hadoop Tutorial hzps://developer.yahoo.com/hadoop/tutorial/index.html to Data Science, Bill Howe, University of Washington 4-8 August