Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer

Transcription

1 Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer Glenn Lockwood and Mahidhar Tatineni User Services Group San Diego Supercomputer Center XSEDE14, Atlanta July 14, 2014

2 Overview Hadoop framework extensively used for scalable distributed processing of large datasets. Hadoop is built to process data in orders of several hundred gigabytes to several terabytes (and even petabytes at the extreme end). Data sizes are much bigger than the capacities (both disk and memory) of individual nodes. Under Hadoop Distributed Filesystem (HDFS), data is split into chunks which are managed by different nodes. Data chunks are replicated across several machines to provide redundancy in case of an individual node failure. Processing must conform to Map-Reduce programming model. Processes are scheduled close to location of data chunks being accessed.

3 Hadoop: Application Areas Hadoop is widely used in data intensive analysis. Some application areas include: Log aggregation and processing Video and Image analysis Data mining, Machine learning Indexing Recommendation systems Data intensive scientific applications can make use of the Hadoop MapReduce framework. Application areas include: Bioinformatics and computational biology Astronomical image processing Natural Language Processing Geospatial data processing Some Example Projects Genetic algorithms, particle swarm optimization, ant colony optimization Big data for business analytics (class) Hadoop for remote sensing analysis Extensive list online at:

4 Hadoop Architecture Master Node Jobtracker Namenode Network/ Switching Compute/ Datanode Compute/ Datanode Compute/ Datanode Compute/ Datanode Tasktracker Tasktracker Tasktracker Tasktracker Map/Reduce Framework HDFS Distributed Filesystem Software to enable distributed computation. Jobtracker schedules and manages map/ reduce tasks. Tasktracker does the execution of tasks on the nodes. Metadata handled by the Namenode. Files are split up and stored on datanodes (typically local disk). Scalable and fault tolerance. Replication is done asynchronously.

5 Hadoop Workflow

6 Map Reduce Basics The MapReduce algorithm is a scalable (thousands of nodes) approach to process large volumes of data. The most important aspect is to restrict arbitrary sharing of data. This minimizes the communication overhead which typically constrains scalability. The MapReduce algorithm has two components Mapping and Reducing. The first step involves taking each individual element and mapping it to an output data element using some function. The second reducing step involves aggregating the values from step 1 based on a reducer function. In the hadoop framework both these functions receive key, value pairs and output key, value pairs. A simple word count example illustrates this process in the following slides.

7 Simple Example From Apache Site* Simple wordcount example. Code details: Functions defined - Wordcount map class : reads file and isolates each word - Reduce class : counts words and sums up Call sequence - Mapper class - Combiner class (Reduce locally) - Reduce class - Output *

8 Simple Example From Apache Site* Simple wordcount example. Two input files. File 1 contains: Hello World Bye World File 2 contains: Hello Hadoop Goodbye Hadoop Assuming we use two map tasks (one for each file). Step 1: Map read/parse tasks are complete. Result: Task 1 <Hello, 1> <World, 1> <Bye, 1> <World, 1> Task 2 < Hello, 1> < Hadoop, 1> < Goodbye, 1> <Hadoop, 1> *

9 Simple Example (Contd) Step 2 : Combine on each node, sorted: Task 1 <Bye, 1> <Hello, 1> <World, 2> Task 2 < Goodbye, 1> < Hadoop, 2> <Hello, 1> Step 3 : Global reduce: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>

10 Map/Reduce Execution Process Components Input / Map () / Shuffle / Sort / Reduce () / Output Jobtracker determines number of splits (configurable). Jobtracker selects compute nodes for tasks based on network proximity to data sources. Tasktracker on each compute node manages the tasks assigned and reports back to jobtracker when task is complete. As map tasks complete jobtracker notifies selected task trackers for reduce phase. Job is completed once reduce phase is complete.

11 Hadoop MapReduce Data pipeline Ref: "Hadoop Tutorial from Yahoo!" by Yahoo! Inc. is licensed under a Creative Commons Attribution 3.0 Unported License

12 HDFS Architecture HDFS Client Write NameNode Metadata (Name, Replicas) /home/user1/file1, 3 /home/user1/file2, 3 HDFS Client Read DataNode DataNode DataNode Replicated

13 HDFS Overview HDFS is a block-structured filesystem. Files are split up into blocks of a fixed size (configurable) and stored across the cluster on several nodes. The metadata for the filesystem is stored on the NameNode. Typically the metadata info is cached in the NameNode s memory for fast access. The NameNode provides the list of locations (DataNodes) of blocks for a file and the clients can read directly from the location (without going through the NameNode). The HDFS namespace is completely separate from the local filesystems on the nodes. HDFS has its own tools to list, copy, move, and remove files.

14 HDFS Performance Envelope HDFS is optimized for large sequential streaming reads. The default block size is 64MB (in comparison typical block structured filesystems use 4-8KB, and Lustre uses ~1MB). The default replication is 3. Typical HDFS applications have a write-once-ready many access model. This is taken into account to simplify the data coherency model used. The large block size also means poor random seek times and poor performance with small sized reads/writes. Additionally, given the large file sizes, there is no mechanism for local caching (faster to re-read the data from HDFS). Given these constraints, HDFS should not be used as a general-purpose distributed filesystem for a wide application base.

15 HDFS Configuration Configuration files are in the main hadoop config directory. The file core-site.xml or hdfs-site.xml can be used to change settings. The config directory can either be replicated across a cluster or placed in a shared filesystem (can be an issue if the cluster gets big). Configuration settings are a set of key-value pairs: <property> <name>property-name</name> <value>property-value</value> </property> Adding the line <final>true</final> prevents user override.

16 HDFS Configuration Parameters fs.default.name This is the address which describes the NameNode for the cluster. For example: hdfs://gcn-4-31.ibnet0:54310 dfs.data.dir This is the path on the local filesystem in which the DataNode stores data. Defaults to hadoop tmp location. dfs.name.dir This is where the NameNode metadata is stored. Again defaults to hadoop tmp location. dfs.replication Default replication for each block of data. Default value is 3. dfs.blocksize Default block size. Default value is 64MB Lots of other options. See apache site:

17 HDFS: Listing Files Reminder: Local filesystem commands don t work on HDFS. A simple ls will just end up listing your local files. Need to use HDFS commands. For example: hadoop dfs -ls / Warning: $HADOOP_HOME is deprecated. Found 1 items drwxr-xr-x - mahidhar supergroup :00 /scratch

18 Copying files into HDFS First make some directories in HDFS: hadoop dfs -mkdir /user hadoop dfs -mkdir /user/mahidhar Do a local listing (in this case we have large 50+GB file): ls -lt total rw-r--r-- 1 mahidhar hpss Nov 6 14:48 test.file Copy it into HDFS: hadoop dfs -put test.file /user/mahidhar/ hadoop dfs -ls /user/mahidhar/ Warning: $HADOOP_HOME is deprecated. Found 1 items -rw-r--r-- 3 mahidhar supergroup :10 /user/mahidhar/test.file

19 DFS Command List Command Description Command Description -ls path Lists contents of directory -get src localdest Copy from HDFS to local filesystem -lsr path Recursive display of contents -getmerge src localdest Copy from HDFS and merges into single local file -du path Shows disk usage in bytes -cat filename Display contents of HDFS file -dus path Summary of disk usage -tail file Shows the last 1KB of HDFS file on stdout -mv src dest Move files or directories within HDFS -chmod [-R] Change file permissions in HDFS -cp src dest Copy files or directories within HDFS -chown [-R] Change ownership in HDFS -rm path Removes the file or empty directory in HDFS -chgrp [-R] Change group in HDFS -rmr path Recursively removes file or directory -help Returns usage info -put localsrc dest (Also copyfromlocal) Copy file from local filesystem into HDFS

20 HDFS dfsadmin options Command hadoop dfsadmin -report hadoop dfsadmin metasave filename hadoop dfsadmin safemode options hadoop dfsadmin -refreshnodes Upgrade options hadoop dfsadmin -help Description Generate status report for HDFS Metadata info is saved to file. Cannot be used for restore. Move HDFS to read-only mode Changing HDFS membership. Useful if nodes are being removed from cluster. Status/ finalization Help info

21 hadoop fsck -blocks /user/mahidhar/test.file.4gb Warning: $HADOOP_HOME is deprecated. HDFS fsck FSCK started by mahidhar from / for path /user/mahidhar/test.file.4gb at Wed Nov 06 19:32:30 PST 2013.Status: HEALTHY Total size: B Total dirs: 0 Total files: 1 Total blocks (validated): 64 (avg. block size B) Minimally replicated blocks: 64 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Wed Nov 06 19:32:30 PST 2013 in 36 milliseconds The filesystem under path '/user/mahidhar/test.file.4gb' is HEALTHY

22 HDFS Distributed Copy Useful to migrate large numbers of files Syntax: hadoop distcp src dest Source/Destinations can be: S3 URLs HDFS URL Filesystem [Useful if it s a parallel filesystem like Lustre]

23 HDFS Architecture: Summary HDFS optimized for large block I/O. Replication(3) turned on by default. HDFS specific commands to manage files and directories. Local filesystem commands will not work in HDFS. HDFS parameters set in xml configuration files. DFS parameters can also be passed with commands. fsck function available for verification. Block rebalancing function available. Java based HDFS API can be used for access within applications. Library available for functionality in C/C++ applications.

24 Useful Links Yahoo tutorial on HDFS: Apache HDFS guide: Brad Hedlund s article on Hadoop architecture :

25 Hadoop and HPC PROBLEM: domain scientists/researchers aren't using Hadoop Hadoop is commonly used in the data analytics industry but not so common in domain science academic areas. Java is not always high-performance Hurdles for domain scientists to learn Java, Hadoop tools. SOLUTION: make Hadoop easier for HPC users use existing HPC clusters and software use Perl/Python/C/C++/Fortran instead of Java make starting Hadoop as easy as possible

26 Compute: Traditional vs. Data-Intensive Traditional HPC CPU-bound problems Solution: OpenMPand MPI-based parallelism Data-Intensive IO-bound problems Solution: Map/reducebased parallelism

27 Architecture for Both Workloads PROs High-speed interconnect Complementary object storage Fast CPUs, RAM Less faulty CONs Nodes aren't storagerich Transferring data between HDFS and object storage* * unless using Lustre, S3, etc backends

28 Add Data Analysis to Existing Compute Infrastructure

32 myhadoop 3-step Install (not reqd. for today s workshop) 1. Download Apache Hadoop 1.x and myhadoop 0.30 $ wget hadoop-1.2.1/hadoop bin.tar.gz $ wget myhadoop-0.30.tar.gz 2. Unpack both Hadoop and myhadoop $ tar zxvf hadoop bin.tar.gz $ tar zxvf myhadoop-0.30.tar.gz 3. Apply myhadoop patch to Hadoop $ cd hadoop-1.2.1/conf $ patch <../myhadoop-0.30/myhadoop patch

33 myhadoop 3-step Cluster 1. Set a few environment variables # sets HADOOP_HOME, JAVA_HOME, and PATH $ module load hadoop $ export HADOOP_CONF_DIR=$HOME/mycluster.conf 2. Run myhadoop-configure.sh to set up Hadoop $ myhadoop-configure.sh -s /scratch/$user/$pbs_jobid 3. Start cluster with Hadoop's start-all.sh $ start-all.sh

34 Advanced Features - Useability System-wide default configurations myhadoop-0.30/conf/myhadoop.conf MH_SCRATCH_DIR specify location of node-local storage for all users MH_IPOIB_TRANSFORM specify regex to transform node hostnames into IP over InfiniBand hostnames Users can remain totally ignorant of scratch disks and InfiniBand Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters myhadoop figures out everything else

35 Advanced Features - Useability Parallel filesystem support HDFS on Lustre via myhadoop persistent mode (-p) Direct Lustre support (IDH) No performance loss at smaller scales for HDFS on Lustre Resource managers supported in unified framework: Torque 2.x and 4.x Tested on SDSC Gordon SLURM 2.6 Tested on TACC Stampede Grid Engine Can support LSF, PBSpro, Condor easily (need testbeds)

36 Hadoop-RDMA Network-Based Computing Lab (Ohio State University) HDFS, MapReduce, and RPC over native InfiniBand and RDMA over Converged Ethernet (RoCE). Based on Apache Hadoop Can use myhadoop to partially set up configuration. Need to add to some of the configuration files. Location on Gordon: /home/diag/opt/hadoop-rdma/hadoop-rdma More details :

37 Data-Intensive Performance Scaling 8-node Hadoop cluster on Gordon 8 GB VCF (Variant Calling Format) file 9x speedup using 2 mappers/node

38 Data-Intensive Performance Scaling 7.7 GB of chemical evolution data, 270 MB/s processing rate

39 Hadoop Installation & Configuration USING GORDON

40 Logging into Gordon Mac/Linux: ssh Windows (PuTTY): gordon.sdsc.edu

41 Example Locations Today s examples are located in the training account home directories: ~/XSEDE14 Following subdirectories should be visible: ls ~/XSEDE14 ANAGRAM GUTENBERG mahout Pig2 streaming

42 What is Gordon? 1024 compute nodes, 64 IO nodes 16 cores of Intel Xeon E GB of DDR3 DRAM 300 GB enterprise SSD QDR InfiniBand 40 Gbps QDR4X (cf. 10 Gb Ethernet) two rails 3D torus topology Torque batch system

43 Request Nodes with Torque Rocks 6.1 (Emerald Boa) Profile built 13: Aug Kickstarted 13: Aug WELCOME TO / \ / / / / \ / / / \_ / _ / \ / / \ \ / /_ /_/ / / // / / /_/ / / /_/ / / / /_/ / / /_/ / / / / / / / / / / \ / \ / \ //_/ \,_/ \ //_/ /_/ ***************************************************************************** gordon- ln1:~$ qsub I l nodes=2:ppn=16:native:flash,walltime=4:00:00 q normal qsub: waiting for job gordon- fe2.local to start qsub: job gordon- fe2.local ready gcn- 3-15:~$

44 Request Nodes with Torque qsub I l nodes=2:ppn=16:native:flash,walltime=4:00:00 q normal v Catalina_maxhops=None,QOS=0 - I request Interactive access to your nodes - l nodes=2 request two nodes ppn=16 request 16 processors per node native request native nodes (not vsmp VMs) flash request nodes with SSDs walltime=4:00:00 reserve these nodes for four hours - q normal not requesting any special queues Use the idev alias so you don't have to memorize this!

45 Hadoop Installation & Configuration GENERAL INSTALLATION AND CONFIGURATION

46 Installing Hadoop on Linux 1. Download Hadoop tarball from Apache ( $ wget hadoop /hadoop bin.tar.gz... Length: (36M) [application/x- gzip] Saving to: hadoop bin.tar.gz.2 85% [================================> ] 32,631, M/s eta 2s 2. Unpack tarball $ tar zxvf hadoop bin.tar.gz hadoop /... hadoop /src/contrib/ec2/bin/list- hadoop- clusters hadoop /src/contrib/ec2/bin/terminate- hadoop- cluster

47 Configure Your First Hadoop Cluster (on Gordon) Only necessary once per qsub gcn- 3-15:~$ myhadoop- configure.sh myhadoop: Keeping HADOOP_HOME=/home/diag/opt/hadoop/hadoop from user environment myhadoop: Setting MH_IPOIB_TRANSFORM='s/$/.ibnet0/' from myhadoop.conf myhadoop: Setting MH_SCRATCH_DIR=/scratch/$USER/$PBS_JOBID from myhadoop.conf myhadoop: Keeping HADOOP_CONF_DIR=/home/glock/mycluster from user environment myhadoop: Using HADOOP_HOME=/home/diag/opt/hadoop/hadoop myhadoop: Using MH_SCRATCH_DIR=/scratch/glock/ gordon- fe2.local myhadoop: Using JAVA_HOME=/usr/java/latest myhadoop: Generating Hadoop configuration in directory in /home/glock/mycluster... myhadoop: Designating gcn ibnet0 as master node myhadoop: The following nodes will be slaves: gcn ibnet0 gcn ibnet0... /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at gcn sdsc.edu/ ************************************************************/

48 myhadoop Interface between batch system (qsub) and Hadoop myhadoop- configure.sh is not a part of Hadoop gets node names from batch environment configures InfiniBand support generates Hadoop config files formats namenode so HDFS is ready to go

49 Configuring Hadoop $HADOOP_CONF_DIR is the cluster all of your cluster's global configurations all of your cluster's nodes all of your cluster's node configurations** echo $HADOOP_CONF_DIR to see your cluster's configuration location cd $HADOOP_CONF_DIR $ ls capacity- scheduler.xml hadoop- policy.xml myhadoop.conf configuration.xsl hdfs- site.xml slaves core- site.xml log4j.properties ssl- client.xml.example fair- scheduler.xml mapred- queue- acls.xml ssl- server.xml.example hadoop- env.sh mapred- site.xml taskcontroller.cfg hadoop- metrics2.properties masters

50 core-site.xml <property> <name>hadoop.tmp.dir</name> <value>/scratch/glock/ gordon- fe2.local/tmp</value> <description>a base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://gcn ibnet0:54310</value> <description>the name of the default file system. </description> </property>

51 core-site.xml <property> <name>io.file.buffer.size</name> <value>131072</value> <description>the size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. </description> </property> More parameters and their default values:

52 hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/scratch/glock/ gordon- fe2.local/namenode_data </value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/scratch/glock/ gordon- fe2.local/hdfs_data</value> <final>true</final> </property>

53 hdfs-site.xml <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.block.size</name> <value> </value> </property> More parameters and their default values:

54 mapred-site.xml <property> <name>mapred.job.tracker</name> <value>gcn ibnet0:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in- process as a single map and reduce task. </description> </property> <property> <name>mapred.local.dir</name> <value>/scratch/glock/ gordon- fe2.local/mapred_scratch </value> <final>true</final> </property>

55 mapred-site.xml <property> <name>io.sort.mb</name> <value>650</value> <description>higher memory- limit while sorting data.</description> </property> <property> <name>mapred.map.tasks</name> <value>4</value> <description>the default number of map tasks per job.</description> </property> <property> <name>mapred.reduce.tasks</name> <value>4</value> <description>the default number of reduce tasks per job. </description> </property>

56 mapred-site.xml <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description>the maximum number of map tasks that will be run simultaneously by a task tracker.</description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description>the maximum number of reduce tasks that will be run simultaneously by a task tracker.</description> </property> More parameters and their default values:

57 hadoop-env.sh Establishes environment variables for all Hadoop components Essentials: HADOOP_LOG_DIR location of Hadoop logs HADOOP_PID_DIR location of Hadoop PID files JAVA_HOME location of Java that Hadoop should use Other common additions LD_LIBRARY_PATH - for mappers/reducers HADOOP_CLASSPATH - for mappers/reducers _JAVA_OPTIONS - to hack global Java options

58 Hadoop Installation & Configuration HANDS-ON CONFIGURATION

59 masters / slaves Used by start- all.sh and stop- all.sh Hadoop control scripts $ cat masters gcn ibnet0 $ cat slaves gcn ibnet0 gcn ibnet0 masters secondary namenodes slaves datanodes + tasktrackers namenode and jobtracker are launched on localhost

60 Launch Your First Hadoop Cluster (on Gordon) $ start- all.sh starting namenode, logging to /scratch/glock/ gordon- fe2.local/logs/hadoop- gloc gcn ibnet0: starting datanode, logging to /scratch/glock/ gordon- fe2.loca gcn ibnet0: starting datanode, logging to /scratch/glock/ gordon- fe2.loca gcn ibnet0: starting secondarynamenode, logging to /scratch/glock/ gordon starting jobtracker, logging to /scratch/glock/ gordon- fe2.local/logs/hadoop- gl gcn ibnet0: starting tasktracker, logging to /scratch/glock/ gordon- fe2.l gcn ibnet0: starting tasktracker, logging to /scratch/glock/ gordon- fe2.l Verify that namenode and datanodes came up: $ hadoop dfsadmin - report Configured Capacity: ( GB) Present Capacity: ( GB) DFS Remaining: ( GB) DFS Used: (16.03 KB) DFS Used%: 0% Under replicated blocks: 1 Blocks with corrupt replicas: 0...

61 Try it Yourself: Change Replication 1. Shut down cluster to change configuration $ stop- all.sh gcn ibnet0: stopping tasktracker gcn ibnet0: stopping tasktracker stopping namenode gcn ibnet0: stopping datanode gcn ibnet0: stopping datanode gcn ibnet0: stopping secondarynamenode 2. Change dfs.replication setting in hdfs- site.xml by adding the following: <property> <name>dfs.replication</name> <value>2</value> </property> 3. Restart (start- all.sh) and re-check

62 Try it Yourself: Change Replication Are the under-replicated blocks gone? What does this mean for actual runtime behavior?

63 Run Some Simple Commands Get a small set of data (a bunch of classic texts) and copy into HDFS: $ ls lh gutenberg.txt - rw- r- - r- - 1 glock sdsc 407M Mar 16 14:09 gutenberg.txt $ hadoop dfs - mkdir data $ hadoop dfs - put gutenberg.txt data/ $ hadoop dfs - ls data Found 1 items - rw- r- - r- - 3 glock supergroup :21 /user/glock/data/gutenberg.txt

64 Run Your First Map/Reduce Job $ hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount data wordcount- output 14/03/16 14:24:18 INFO input.fileinputformat: Total input paths to process : 1 14/03/16 14:24:18 INFO util.nativecodeloader: Loaded the native- hadoop library 14/03/16 14:24:18 WARN snappy.loadsnappy: Snappy native library not loaded 14/03/16 14:24:18 INFO mapred.jobclient: Running job: job_ _ /03/16 14:24:19 INFO mapred.jobclient: map 0% reduce 0% 14/03/16 14:24:35 INFO mapred.jobclient: map 15% reduce 0%... 14/03/16 14:25:23 INFO mapred.jobclient: map 100% reduce 28% 14/03/16 14:25:32 INFO mapred.jobclient: map 100% reduce 100% 14/03/16 14:25:37 INFO mapred.jobclient: Job complete: job_ _ /03/16 14:25:37 INFO mapred.jobclient: Counters: 29 14/03/16 14:25:37 INFO mapred.jobclient: Job Counters 14/03/16 14:25:37 INFO mapred.jobclient: Launched reduce tasks= /03/16 14:25:37 INFO mapred.jobclient: Launched map tasks=7 14/03/16 14:25:37 INFO mapred.jobclient: Data- local map tasks=7

65 Running Hadoop Jobs hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount data wordcount- output hadoop jar launch a map/reduce job $HADOOP_HOME/hadoop- examples jar map/reduce application jar wordcount command-line argument for jarfile data location of input data (file or directory) wordcount- output location to dump job output

66 Check Job Output $ hadoop dfs - ls wordcount- output Found 3 items... glock supergroup... /user/glock/wordcount- output/_success... glock supergroup... /user/glock/wordcount- output/_logs... glock supergroup... /user/glock/wordcount- output/part- r glock supergroup... /user/glock/wordcount- output/part- r _SUCCESS if exists, job finished successfully _logs directory containing job history part- r output of reducer #0 part- r output of reducer #1 part- r- 0000n output of nth reducer

67 Check Job Output $ hadoop job history wordcount- output Hadoop job: 0010_ _glock =====================================... Submitted At: 16- Mar :24:18 Launched At: 16- Mar :24:18 (0sec) Finished At: 16- Mar :25:36 (1mins, 18sec)... Task Summary ============================ Kind Total Successful Failed Killed StartTime FinishTime Setup Mar :24: Mar :24:23 (4sec) Map Mar :24: Mar :25:14 (49sec) Reduce Mar :24: Mar :25:29 (40sec) Cleanup Mar :25: Mar :25:35 (4sec) ============================

68 Retrieve Job Output $ hadoop dfs - get wordcount- output/part*. $ ls - lrtph... - rw- r- - r- - 1 glock sdsc 24M Mar 16 14:31 part- r $ hadoop dfs - rmr wordcount- output Deleted hdfs://gcn ibnet0:54310/user/glock/wordcount- output $ hadoop dfs - ls Found 1 items... glock supergroup... /user/glock/data

69 Silly First Example 1. How many mappers were used? 2. How many reducers were used? 3. If Map and Reduce phases are real calculations, and Setup and Cleanup are "job overhead," what fraction of time was spent on overhead?

70 User-Level Configuration Changes Bigger HDFS block size More reducers $ hadoop dfs - Ddfs.block.size= put./gutenberg.txt data/gutenberg- 128M.txt $ hadoop fsck - block /user/glock/data/gutenberg- 128M.txt... Total blocks (validated): 4 (avg. block size B)... $ hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount - Dmapred.reduce.tasks=4 data/gutenberg- 128M.txt output- 128M

71 User-Level Configuration Changes 1. How many mappers were used before and after specifying the user-level tweaks? What caused this change? 2. How much speedup did these tweaks give? Any ideas why? 3. What happens if you use mapred.reduce.tasks=8 or =16? Any ideas why? 4. hadoop dfs - ls output- 128M How many output files are there now? Why?

72 TestDFSIO Test to verify HDFS read/write performance Generates its own input via - write test Make 8 files, each 1024 MB large: $ hadoop jar $HADOOP_HOME/hadoop- test jar TestDFSIO - write - nrfiles 8 - filesize /03/16 16:17:20 INFO fs.testdfsio: nrfiles = 8 14/03/16 16:17:20 INFO fs.testdfsio: filesize (MB) = /03/16 16:17:20 INFO fs.testdfsio: buffersize = /03/16 16:17:20 INFO fs.testdfsio: creating control file: 1024 mega bytes, 8 files 14/03/16 16:17:20 INFO fs.testdfsio: created control files for: 8 files... 14/03/16 16:19:04 INFO fs.testdfsio: TestDFSIO : write 14/03/16 16:19:04 INFO fs.testdfsio: Date & time: Sun Mar 16 16:19:04 PDT /03/16 16:19:04 INFO fs.testdfsio: Number of files: 8 14/03/16 16:19:04 INFO fs.testdfsio: Total MBytes processed: /03/16 16:19:04 INFO fs.testdfsio: Throughput mb/sec: /03/16 16:19:04 INFO fs.testdfsio: Average IO rate mb/sec: /03/16 16:19:04 INFO fs.testdfsio: IO rate std deviation: /03/16 16:19:04 INFO fs.testdfsio: Test exec time sec:

73 TestDFSIO Test the read performance use - read instead of - write: Why such a big performance difference? $ hadoop jar $HADOOP_HOME/hadoop- test jar TestDFSIO - read - nrfiles 8 - filesize /03/16 16:19:59 INFO fs.testdfsio: nrfiles = 8 14/03/16 16:19:59 INFO fs.testdfsio: filesize (MB) = /03/16 16:19:59 INFO fs.testdfsio: buffersize = /03/16 16:19:59 INFO fs.testdfsio: creating control file: 1024 mega bytes, 8 files 14/03/16 16:20:00 INFO fs.testdfsio: created control files for: 8 files... 14/03/16 16:20:40 INFO fs.testdfsio: TestDFSIO : read 14/03/16 16:20:40 INFO fs.testdfsio: Date & time: Sun Mar 16 16:20:40 PDT /03/16 16:20:40 INFO fs.testdfsio: Number of files: 8 14/03/16 16:20:40 INFO fs.testdfsio: Total MBytes processed: /03/16 16:20:40 INFO fs.testdfsio: Throughput mb/sec: /03/16 16:20:40 INFO fs.testdfsio: Average IO rate mb/sec: /03/16 16:20:40 INFO fs.testdfsio: IO rate std deviation: /03/16 16:20:40 INFO fs.testdfsio: Test exec time sec:

74 HDFS Read/Write Performance 1. What part of the HDFS architecture might explain the difference in how long it takes to read 8x1 GB files vs. write 8x1 GB files? 2. What configuration have we covered that might alter this performance gap? 3. Adjust this parameter, restart the cluster, and re-try. How close can you get the read/write performance gap?

75 TeraSort Heavy-Duty Map/Reduce Job Standard benchmark to assess performance of Hadoop's MapReduce and HDFS components teragen Generate a lot of random input data terasort Sort teragen's output teravalidate Verify that terasort's output is sorted

76 Running TeraGen hadoop jar $HADOOP_HOME/hadoop- examples jar teragen terasort- input teragen launch the teragen application how many 100-byte records to generate (100,000,000 ~ 9.3 GB) terasort- input location to store output data to be fed into terasort

77 Running TeraSort $ hadoop jar $HADOOP_HOME/hadoop- examples jar teragen terasort- input Generating using 2 maps with step of $ hadoop jar $HADOOP_HOME/hadoop- examples jar terasort terasort- input terasort- output 14/03/16 16:43:00 INFO terasort.terasort: starting 14/03/16 16:43:00 INFO mapred.fileinputformat: Total input paths to process : 2...

78 TeraSort Default Behavior Default settings are suboptimal: teragen use 2 maps, so TeraSort input split between 2 files terasort use 1 reducer so reduction is serial Try optimizing TeraSort: Double dfs.block.size to (128MB) Increase mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to increase concurrency Launch with more reducers (- Dmapred.reduce.tasks=XX)

79 Summary of Commands ### Request two nodes for four hours on Gordon $ qsub - I - l nodes=2:ppn=16:native:flash,walltime=4:00:00 - q normal ### Configure $HADOOP_CONF_DIR on Gordon $ myhadoop- configure.sh ### Hadoop control script to start all nodes $ start- all.sh ### Verify HDFS is online $ hadoop dfsadmin - report ### Copy file to HDFS $ hadoop dfs - put somelocalfile hdfsdir/ ### View file information on HDFS $ hadoop fsck - block hdfsdir/somelocalfile ### Run a map/reduce job $ hadoop jar somejarfile.jar - option1 - option2 ### View job info after it completes $ hadoop job - history hdfsdir/outputdir ### Shut down all Hadoop nodes $ stop- all.sh ### Copy logfiles back from nodes $ myhadoop- cleanup.sh

80 Anagram Example Source: Uses Map-Reduce approach to process a file with a list of words, and identify all the anagrams in the file Code is written in Java. Example has already been compiled and the resulting jar file is in the example directory.

81 Anagram Map Class (Detail) String word = value.tostring(); Convert the word to a string char[] wordchars = word.tochararray(); Assign to character array Arrays.sort(wordChars); Sort the array of characters String sortedword = new String(wordChars); Create new string with sorted characters sortedtext.set(sortedword); orginaltext.set(word); outputcollector.collect(sortedtext, orginaltext); Prepare and output the sorted text string (serves as the key), and the original test string.

82 Anagram Map Class (Detail) Consider file with list of words : alpha, hills, shill, truck Alphabetically sorted words : aahlp, hills, hills, ckrtu Hence after the Map Step is done, the following key pairs would be generated: (aahlp,alpha) (hills,hills) (hills,shill) (ckrtu,truck)

83 Anagram Example Reducer (Detail) while(anagramvalues.hasnext()) { Text anagam = anagramvalues.next(); output = output + anagam.tostring() + "~"; } Iterate over all the values for a key. hasnext() is a Java method that allows you to do this. We are also creating an output string which has all the words separated with ~. StringTokenizer outputtokenizer = new StringTokenizer(output,"~ ); StringTokenizer class allows you to store this delimited string and has functions that allow you to count the number of tokens. if(outputtokenizer.counttokens()>=2) { output = output.replace("~", ","); outputkey.set(anagramkey.tostring()); outputvalue.set(output); results.collect(outputkey, outputvalue); } We output the anagram key and the word lists if the number of tokens is >=2 (i.e. we have an anagram pair).

84 Anagram Reducer Class (Detail) For our example set, the input to the Reducers is: (aahlp,alpha) (hills,hills) (hills,shill) (ckrtu,truck) The only key with #tokens >=2 is <hills>. Hence, the Reducer output will be: hills hills, shill,

85 Anagram Example Submit Script #!/bin/bash #PBS -N Anagram #PBS -q normal #PBS -l nodes=2:ppn=16:native #PBS -l walltime=01:00:00 #PBS -m e #PBS -M your @xyz.com #PBS -V #PBS -v Catalina_maxhops=None cd $PBS_O_WORKDIR myhadoop-configure.sh start-all.sh hadoop dfs -mkdir input hadoop dfs -copyfromlocal $PBS_O_WORKDIR/SINGLE.TXT input/ hadoop jar $PBS_O_WORKDIR/AnagramJob.jar input/single.txt output hadoop dfs -copytolocal output/part* $PBS_O_WORKDIR stop-all.sh myhadoop-cleanup.sh

86 Anagram Example Sample Output cat part aabcdelmnu manducable,ambulanced, aabcdeorrsst broadcasters,rebroadcasts, aabcdeorrst rebroadcast,broadcaster, aabcdkrsw drawbacks,backwards, aabcdkrw drawback,backward, aabceeehlnsst teachableness,cheatableness, aabceeelnnrsstu uncreatableness,untraceableness, aabceeelrrt recreatable,retraceable, aabceehlt cheatable,teachable, aabceellr lacerable,clearable, aabceelnrtu uncreatable,untraceable, aabceelorrrstv vertebrosacral,sacrovertebral,

87 Hadoop Streaming Programming Hadoop without Java

88 Hadoop and Python Hadoop streaming w/ Python mappers/reducers portable most difficult (or least difficult) to use you are the glue between Python and Hadoop mrjob (or others: hadoopy, dumbo, etc) comprehensive integration Python interface to Hadoop streaming Analogous interface libraries exist in R, Perl Can interface directly with Amazon

89 Wordcount Example

90 Hadoop Streaming with Python "Simplest" (most portable) method Uses raw Python, Hadoop you are the glue cat input.txt mapper.py sort reducer.py > output.txt provide these two scripts; Hadoop does the rest generalizable to any language you want (Perl, R, etc)

91 Spin Up a Cluster $ idev qsub: waiting for job gordon- fe2.local to start qsub: job gordon- fe2.local ready $ myhadoop- configure.sh myhadoop: Keeping HADOOP_HOME=/home/diag/opt/hadoop/hadoop from user environment myhadoop: Setting MH_IPOIB_TRANSFORM='s/$/.ibnet0/' from myhadoop.conf... $ start- all.sh starting namenode, logging to /scratch/train18/ gordon- fe2.local/logs/hadoop- train18- namenode- gcn sdsc.edu.out gcn ibnet0: starting datanode, logging to /scratch/train18/ gordon- fe2.local/logs/ hadoop- train18- datanode- gcn sdsc.edu.out $ cd PYTHON/streaming $ ls README mapper.py pg2701.txt pg996.txt reducer.py

92 What One Mapper Does line = Call me Ishmael. Some years ago never mind how long keys = Call me Ishmael. Some years ago--never mind how long emit.keyval(key,value)... Call 1 me Ishmael. 1 Some 1 1 years 1 mind 1 ago--never how 1 long 1 1 to the reducers

93 Wordcount: Hadoop streaming mapper #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() keys = line.split() for key in keys: value = 1 print( '%s\t%d' % (key, value) )...

94 Reducer Loop Logic For each key/value pair... If this key is the same as the previous key, Ø add this key's value to our running total. Otherwise, Ø print out the previous key's name and the running total, Ø reset our running total to 0, Ø add this key's value to the running total, and Ø "this key" is now considered the "previous key"

95 Wordcount: Streaming Reducer (1/2) #!/usr/bin/env python import sys last_key = None running_total = 0 for input_line in sys.stdin: input_line = input_line.strip() this_key, value = input_line.split("\t", 1) value = int(value) (to be continued...)

96 Wordcount: Streaming Reducer (2/2) if last_key == this_key: running_total += value else: if last_key: print( "%s\t%d" % (last_key, running_total) ) running_total = value last_key = this_key if last_key == this_key: print( "%s\t%d" % (last_key, running_total) )

97 Testing Mappers/Reducers Debugging Hadoop is not fun $ head - n100 pg2701.txt./mapper.py sort./reducer.py... with 5 word, 1 world you 3 The 1

98 Launching Hadoop Streaming hadoop jar $HADOOP_HOME/contrib/streaming/hadoop- streaming jar - D mapred.reduce.tasks=2 - mapper "$PWD/mapper.py" - reducer "$PWD/reducer.py" - input mobydick.txt - output output.../hadoop- streaming jar - mapper $PWD/mapper.py - reducer $PWD/reducer.py - input mobydick.txt - output output Hadoop Streaming jarfile Mapper executable Reducer executable location (on hdfs) of job input location (on hdfs) of output dir

99 Tools/Applications w/ Hadoop Several open source projects utilizing Hadoop infrastructure. Examples: HIVE A data warehouse infrastructure providing data summarization and querying. Hbase Scalable distributed database PIG High-level data-flow and execution framework Elephantdb Distributed database specializing in exporting key/value data from Hadoop. Some of these projects have been tried on Gordon by users.

100 Apache Mahout Apache project to implement scalable machine learning algorithms. Algorithms implemented: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier

101 Mahout Example Recommendation Mining Collaborative Filtering algorithms aim to solve the prediction problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen. Itembased Collaborative Filtering estimates a user's preference towards an item by looking at his/her preferences towards similar items. Mahout has two Map/Reduce jobs to support Itembased Collaborative Filtering ItemSimiliarityJob Computes similarity values based on input preference data RecommenderJob - Uses input preference data to determine recommended itemids and scores. Can compute Top N recommendations for user for example. Collaborative Filtering is very popular (for example used by Google, Amazon in their recommendations)

102 Mahout Recommendation Mining Once the preference data is collected, a user-item-matrix is created. Task is to predict missing entries based on the known data. Estimate a user s preference for a particular item based on their preferences to similar items. Example: Item 1 Item 2 Item 3 User User 2 5 2? User More details and links to informative presentations at:

103 Mahout Hands on Example Recommendation Job example : Mahout.cmd qsub Mahout.cmd to submit the job. Script also illustrates use of a custom hadoop configuration file. Customized info in mapredsite.xml.stub file which is copied into the configuration used.

104 Mahout Hands On Example The input file is a comma separated list of userids, itemids, and preference scores (test.data). As mentioned in earlier slide, all users need not score all items. In this case there are 5 users and 4 items. The users.txt file provides the list of users for whom we need recommendations. The output from the Mahout recommender job is a file with userids and recommendations (with scores). The show_reco.py file uses the output and combines with user, item info from another input file (items.txt) to produce details recommendation info. This segment runs outside of the hadoop framework.

105 Mahout Sample Output User ID : 2 Rated Items Item 2, rating=2 Item 3, rating=5 Item 4, rating= Recommended Items Item 1, score=

106 PIG PIG - high level programming on top of Hadoop map/reduce Sql-like data flow operations Essentially, key,values abstracted to fields and more complex data strutures

107 Hot off the press on Gordon! Hadoop2 (v2.2.0) installed and available via a module. Apache Spark installed and available via a module. Configured to run with Hadoop2. myhadoop extended to handle Hadoop2/YARN, and Spark.

108 Summary Hadoop framework extensively used for scalable distributed processing of large datasets. Several tools available that leverage the framework. HDFS provides scalable filesystem tailored for accelerating MapReduce performance. Hadoop Streaming can be used to write mapper/reduce functions in any language. Data warehouse (HIVE), scalable distributed database (HBASE), and execution framework (PIG) available. Mahout w/ Hadoop provides scalable machine learning tools.

109 Exercises: Log in to gordon PIG (Hands On) > >qsub -I -q normal -lnodes=2:ppn=1:native,walltime=2:00:00 -A ###### >cd Pig2 > > ps axu grep "p4rod > source Setup_pigtwt.cmd > source Setup_Data4pig.cmd > ps axu grep "p4rod >pig grunt>

110 PIG Hands On - Output > ls Cleanup_pigtwt.cmd Run_pigtwtsrc.qsub hadoophosts.txt pig_wds_grp_cnt.txt Doscript_pigtwt.cmd SetupData4pig.cmd pig_ log rawtw_dtmsg.csv GetDataFrompig.cmd Setup_pigtwt.cmd pig_out_mon_day_cnt.txt tw_script1.pig Setup_4twt gcn ibnet0: starting tasktracker, logging to /scratch/p4rodrig/ gordon-fe2.local/hadoop-p4rodrig/log/hadoop-p4rodrigtasktracker-gcn sdsc.edu.out Warning: $HADOOP_HOME is deprecated Set up Data :11:54,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.hexecutionengine - Connecting to hadoop file system at: hdfs://gcn ibnet0: :11:54,727 [main] INFO org.apache.pig.backend.hadoop.executionengine.hexecutionengine - Connecting to map-reduce job tracker at: gcn ibnet0:54311

111 PIG (Hands On) #watch out for _=_ keep spacing around equal sign raw = LOAD 'pigdata2load.txt' USING PigStorage(',') AS (mon_day,year,hour,msg,place,retwtd,uid,flers_cnt,fling_cnt); tmp = LIMIT raw 5; DUMP tmp; describe raw rawstr = LOAD 'pigdata2load.txt' USING PigStorage(',') AS (mon_day:chararray,year:int,hour:chararray,msg:chararray,place:chararray,retwted:char array,uid:int,flers_cnt:int,fling_cnt:int); describe rawstr

112 PIG Hands On (output) 1 dump tmp INFO org.apache.pig.backend.hadoop.executionengine.util.mapredutil - Total input paths to process : 1 (Apr 03,2012,09:11:56,havent you signe the petition yet hrd alkhawaja in his 55 day of hungerstrike well send it to obama httpstco8ww2sfzx bahrain,null,false, ,185,null ) (Apr 08,2012,07:02:08,rt azmoderate uc chuckgrassley pres obama has clearly demonstrated his intelligence you on the other hand look sound like you shou,null,false, ,904,null ) (Apr 12,2012,13:02:44,rt dcgretch 2 describe raw raw: {mon_day: bytearray,year: bytearray,hour: bytearray,msg: bytearray,place: bytearray,retwtd: bytearray,uid: bytearray,flers_cnt: bytearray,fling_cnt: bytearray} 3 describe rawstr rawstr: {mon_day: chararray,year: int,hour: chararray,msg: chararray,place: chararray,retwted: chararray,uid: int,flers_cnt: int,fling_cnt: int}

113 PIG (Hands On) rawbymon_day = GROUP rawstr BY mon_day; -- mon_day_cnt = FOREACH rawbymon_day GENERATE group,count(*); -- return error about cast mon_day_cnt = FOREACH rawbymon_day GENERATE group,count(rawstr); tmpc = LIMIT mon_day_cnt 5; DUMP tmpc; STORE mon_day_cnt INTO 'pig_out_mon_day_cnt.txt';

114 PIG Hands On - output INFO org.apache.pig.tools.pigstats.simplepigstats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features p4rodrig :19: :20:35 GROUP_BY Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_ _ mon_day_cnt,rawbymon_day,rawstr GROUP_BY,COMBINER hdfs://gcn ibnet0:54310/user/ p4rodrig/pig_out_mon_day_cnt.txt, Input(s): Successfully read 109 records (13996 bytes) from: "hdfs://gcn ibnet0:54310/user/p4rodrig/ pigdata2load.txt"

115 PIG (Hands On) raw_msg = FOREACH rawstr {mwds=tokenize(msg); GENERATE $0,$1,$2,mwds;}; tmpd= LIMIT raw_msg 5; DUMP tmpd; raw_msgf = FOREACH rawstr {mwds=tokenize(msg); GENERATE $0,$1,$2,flatten(mwds);}; tmpd = LIMIT raw_msg 5; DUMP tmpd;

116 PIG Hands On - output First dump (Apr 03,2012,09:11:56,{(havent),(you),(signe),(the),(petition),(yet),(hrd),(alkhawaja),(in),(his),(55),(day),(of), (hungerstrike),(well),(send),(it),(to),(obama),(httpstco8ww2sfzx),(bahrain)}) (Apr 08,2012,07:02:08,{(rt),(azmoderate),(uc),(chuckgrassley),(pres),(obama),(has),(clearly),(demonstrated), (his),(intelligence),(you),(on),(the),(other),(hand),(look),(sound),(like),(you),(shou)}) Second dump (Apr 03,2012,09:11:56,{(havent),(you),(signe),(the),(petition),(yet),(hrd),(alkhawaja),(in),(his),(55),(day), (of),(hungerstrike),(well),(send),(it),(to),(obama),(httpstco8ww2sfzx),(bahrain)}) (Apr 08,2012,07:02:08,{(rt),(azmoderate),(uc),(chuckgrassley),(pres),(obama),(has),(clearly), (demonstrated),(his),(intelligence),(you),(on),(the),(other),(hand),(look),(sound),(like),(you),(shou)})