Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer

Size: px
Start display at page:

Download "Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer"

Transcription

1 Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer Glenn Lockwood and Mahidhar Tatineni User Services Group San Diego Supercomputer Center XSEDE14, Atlanta July 14, 2014

2 Overview Hadoop framework extensively used for scalable distributed processing of large datasets. Hadoop is built to process data in orders of several hundred gigabytes to several terabytes (and even petabytes at the extreme end). Data sizes are much bigger than the capacities (both disk and memory) of individual nodes. Under Hadoop Distributed Filesystem (HDFS), data is split into chunks which are managed by different nodes. Data chunks are replicated across several machines to provide redundancy in case of an individual node failure. Processing must conform to Map-Reduce programming model. Processes are scheduled close to location of data chunks being accessed.

3 Hadoop: Application Areas Hadoop is widely used in data intensive analysis. Some application areas include: Log aggregation and processing Video and Image analysis Data mining, Machine learning Indexing Recommendation systems Data intensive scientific applications can make use of the Hadoop MapReduce framework. Application areas include: Bioinformatics and computational biology Astronomical image processing Natural Language Processing Geospatial data processing Some Example Projects Genetic algorithms, particle swarm optimization, ant colony optimization Big data for business analytics (class) Hadoop for remote sensing analysis Extensive list online at:

4 Hadoop Architecture Master Node Jobtracker Namenode Network/ Switching Compute/ Datanode Compute/ Datanode Compute/ Datanode Compute/ Datanode Tasktracker Tasktracker Tasktracker Tasktracker Map/Reduce Framework HDFS Distributed Filesystem Software to enable distributed computation. Jobtracker schedules and manages map/ reduce tasks. Tasktracker does the execution of tasks on the nodes. Metadata handled by the Namenode. Files are split up and stored on datanodes (typically local disk). Scalable and fault tolerance. Replication is done asynchronously.

5 Hadoop Workflow

6 Map Reduce Basics The MapReduce algorithm is a scalable (thousands of nodes) approach to process large volumes of data. The most important aspect is to restrict arbitrary sharing of data. This minimizes the communication overhead which typically constrains scalability. The MapReduce algorithm has two components Mapping and Reducing. The first step involves taking each individual element and mapping it to an output data element using some function. The second reducing step involves aggregating the values from step 1 based on a reducer function. In the hadoop framework both these functions receive key, value pairs and output key, value pairs. A simple word count example illustrates this process in the following slides.

7 Simple Example From Apache Site* Simple wordcount example. Code details: Functions defined - Wordcount map class : reads file and isolates each word - Reduce class : counts words and sums up Call sequence - Mapper class - Combiner class (Reduce locally) - Reduce class - Output *

8 Simple Example From Apache Site* Simple wordcount example. Two input files. File 1 contains: Hello World Bye World File 2 contains: Hello Hadoop Goodbye Hadoop Assuming we use two map tasks (one for each file). Step 1: Map read/parse tasks are complete. Result: Task 1 <Hello, 1> <World, 1> <Bye, 1> <World, 1> Task 2 < Hello, 1> < Hadoop, 1> < Goodbye, 1> <Hadoop, 1> *

9 Simple Example (Contd) Step 2 : Combine on each node, sorted: Task 1 <Bye, 1> <Hello, 1> <World, 2> Task 2 < Goodbye, 1> < Hadoop, 2> <Hello, 1> Step 3 : Global reduce: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>

10 Map/Reduce Execution Process Components Input / Map () / Shuffle / Sort / Reduce () / Output Jobtracker determines number of splits (configurable). Jobtracker selects compute nodes for tasks based on network proximity to data sources. Tasktracker on each compute node manages the tasks assigned and reports back to jobtracker when task is complete. As map tasks complete jobtracker notifies selected task trackers for reduce phase. Job is completed once reduce phase is complete.

11 Hadoop MapReduce Data pipeline Ref: "Hadoop Tutorial from Yahoo!" by Yahoo! Inc. is licensed under a Creative Commons Attribution 3.0 Unported License

12 HDFS Architecture HDFS Client Write NameNode Metadata (Name, Replicas) /home/user1/file1, 3 /home/user1/file2, 3 HDFS Client Read DataNode DataNode DataNode Replicated

13 HDFS Overview HDFS is a block-structured filesystem. Files are split up into blocks of a fixed size (configurable) and stored across the cluster on several nodes. The metadata for the filesystem is stored on the NameNode. Typically the metadata info is cached in the NameNode s memory for fast access. The NameNode provides the list of locations (DataNodes) of blocks for a file and the clients can read directly from the location (without going through the NameNode). The HDFS namespace is completely separate from the local filesystems on the nodes. HDFS has its own tools to list, copy, move, and remove files.

14 HDFS Performance Envelope HDFS is optimized for large sequential streaming reads. The default block size is 64MB (in comparison typical block structured filesystems use 4-8KB, and Lustre uses ~1MB). The default replication is 3. Typical HDFS applications have a write-once-ready many access model. This is taken into account to simplify the data coherency model used. The large block size also means poor random seek times and poor performance with small sized reads/writes. Additionally, given the large file sizes, there is no mechanism for local caching (faster to re-read the data from HDFS). Given these constraints, HDFS should not be used as a general-purpose distributed filesystem for a wide application base.

15 HDFS Configuration Configuration files are in the main hadoop config directory. The file core-site.xml or hdfs-site.xml can be used to change settings. The config directory can either be replicated across a cluster or placed in a shared filesystem (can be an issue if the cluster gets big). Configuration settings are a set of key-value pairs: <property> <name>property-name</name> <value>property-value</value> </property> Adding the line <final>true</final> prevents user override.

16 HDFS Configuration Parameters fs.default.name This is the address which describes the NameNode for the cluster. For example: hdfs://gcn-4-31.ibnet0:54310 dfs.data.dir This is the path on the local filesystem in which the DataNode stores data. Defaults to hadoop tmp location. dfs.name.dir This is where the NameNode metadata is stored. Again defaults to hadoop tmp location. dfs.replication Default replication for each block of data. Default value is 3. dfs.blocksize Default block size. Default value is 64MB Lots of other options. See apache site:

17 HDFS: Listing Files Reminder: Local filesystem commands don t work on HDFS. A simple ls will just end up listing your local files. Need to use HDFS commands. For example: hadoop dfs -ls / Warning: $HADOOP_HOME is deprecated. Found 1 items drwxr-xr-x - mahidhar supergroup :00 /scratch

18 Copying files into HDFS First make some directories in HDFS: hadoop dfs -mkdir /user hadoop dfs -mkdir /user/mahidhar Do a local listing (in this case we have large 50+GB file): ls -lt total rw-r--r-- 1 mahidhar hpss Nov 6 14:48 test.file Copy it into HDFS: hadoop dfs -put test.file /user/mahidhar/ hadoop dfs -ls /user/mahidhar/ Warning: $HADOOP_HOME is deprecated. Found 1 items -rw-r--r-- 3 mahidhar supergroup :10 /user/mahidhar/test.file

19 DFS Command List Command Description Command Description -ls path Lists contents of directory -get src localdest Copy from HDFS to local filesystem -lsr path Recursive display of contents -getmerge src localdest Copy from HDFS and merges into single local file -du path Shows disk usage in bytes -cat filename Display contents of HDFS file -dus path Summary of disk usage -tail file Shows the last 1KB of HDFS file on stdout -mv src dest Move files or directories within HDFS -chmod [-R] Change file permissions in HDFS -cp src dest Copy files or directories within HDFS -chown [-R] Change ownership in HDFS -rm path Removes the file or empty directory in HDFS -chgrp [-R] Change group in HDFS -rmr path Recursively removes file or directory -help Returns usage info -put localsrc dest (Also copyfromlocal) Copy file from local filesystem into HDFS

20 HDFS dfsadmin options Command hadoop dfsadmin -report hadoop dfsadmin metasave filename hadoop dfsadmin safemode options hadoop dfsadmin -refreshnodes Upgrade options hadoop dfsadmin -help Description Generate status report for HDFS Metadata info is saved to file. Cannot be used for restore. Move HDFS to read-only mode Changing HDFS membership. Useful if nodes are being removed from cluster. Status/ finalization Help info

21 hadoop fsck -blocks /user/mahidhar/test.file.4gb Warning: $HADOOP_HOME is deprecated. HDFS fsck FSCK started by mahidhar from / for path /user/mahidhar/test.file.4gb at Wed Nov 06 19:32:30 PST 2013.Status: HEALTHY Total size: B Total dirs: 0 Total files: 1 Total blocks (validated): 64 (avg. block size B) Minimally replicated blocks: 64 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Wed Nov 06 19:32:30 PST 2013 in 36 milliseconds The filesystem under path '/user/mahidhar/test.file.4gb' is HEALTHY

22 HDFS Distributed Copy Useful to migrate large numbers of files Syntax: hadoop distcp src dest Source/Destinations can be: S3 URLs HDFS URL Filesystem [Useful if it s a parallel filesystem like Lustre]

23 HDFS Architecture: Summary HDFS optimized for large block I/O. Replication(3) turned on by default. HDFS specific commands to manage files and directories. Local filesystem commands will not work in HDFS. HDFS parameters set in xml configuration files. DFS parameters can also be passed with commands. fsck function available for verification. Block rebalancing function available. Java based HDFS API can be used for access within applications. Library available for functionality in C/C++ applications.

24 Useful Links Yahoo tutorial on HDFS: Apache HDFS guide: Brad Hedlund s article on Hadoop architecture :

25 Hadoop and HPC PROBLEM: domain scientists/researchers aren't using Hadoop Hadoop is commonly used in the data analytics industry but not so common in domain science academic areas. Java is not always high-performance Hurdles for domain scientists to learn Java, Hadoop tools. SOLUTION: make Hadoop easier for HPC users use existing HPC clusters and software use Perl/Python/C/C++/Fortran instead of Java make starting Hadoop as easy as possible

26 Compute: Traditional vs. Data-Intensive Traditional HPC CPU-bound problems Solution: OpenMPand MPI-based parallelism Data-Intensive IO-bound problems Solution: Map/reducebased parallelism

27 Architecture for Both Workloads PROs High-speed interconnect Complementary object storage Fast CPUs, RAM Less faulty CONs Nodes aren't storagerich Transferring data between HDFS and object storage* * unless using Lustre, S3, etc backends

28 Add Data Analysis to Existing Compute Infrastructure

29 Add Data Analysis to Existing Compute Infrastructure

30 Add Data Analysis to Existing Compute Infrastructure

31 Add Data Analysis to Existing Compute Infrastructure

32 myhadoop 3-step Install (not reqd. for today s workshop) 1. Download Apache Hadoop 1.x and myhadoop 0.30 $ wget hadoop-1.2.1/hadoop bin.tar.gz $ wget myhadoop-0.30.tar.gz 2. Unpack both Hadoop and myhadoop $ tar zxvf hadoop bin.tar.gz $ tar zxvf myhadoop-0.30.tar.gz 3. Apply myhadoop patch to Hadoop $ cd hadoop-1.2.1/conf $ patch <../myhadoop-0.30/myhadoop patch

33 myhadoop 3-step Cluster 1. Set a few environment variables # sets HADOOP_HOME, JAVA_HOME, and PATH $ module load hadoop $ export HADOOP_CONF_DIR=$HOME/mycluster.conf 2. Run myhadoop-configure.sh to set up Hadoop $ myhadoop-configure.sh -s /scratch/$user/$pbs_jobid 3. Start cluster with Hadoop's start-all.sh $ start-all.sh

34 Advanced Features - Useability System-wide default configurations myhadoop-0.30/conf/myhadoop.conf MH_SCRATCH_DIR specify location of node-local storage for all users MH_IPOIB_TRANSFORM specify regex to transform node hostnames into IP over InfiniBand hostnames Users can remain totally ignorant of scratch disks and InfiniBand Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters myhadoop figures out everything else

35 Advanced Features - Useability Parallel filesystem support HDFS on Lustre via myhadoop persistent mode (-p) Direct Lustre support (IDH) No performance loss at smaller scales for HDFS on Lustre Resource managers supported in unified framework: Torque 2.x and 4.x Tested on SDSC Gordon SLURM 2.6 Tested on TACC Stampede Grid Engine Can support LSF, PBSpro, Condor easily (need testbeds)

36 Hadoop-RDMA Network-Based Computing Lab (Ohio State University) HDFS, MapReduce, and RPC over native InfiniBand and RDMA over Converged Ethernet (RoCE). Based on Apache Hadoop Can use myhadoop to partially set up configuration. Need to add to some of the configuration files. Location on Gordon: /home/diag/opt/hadoop-rdma/hadoop-rdma More details :

37 Data-Intensive Performance Scaling 8-node Hadoop cluster on Gordon 8 GB VCF (Variant Calling Format) file 9x speedup using 2 mappers/node

38 Data-Intensive Performance Scaling 7.7 GB of chemical evolution data, 270 MB/s processing rate

39 Hadoop Installation & Configuration USING GORDON

40 Logging into Gordon Mac/Linux: ssh Windows (PuTTY): gordon.sdsc.edu

41 Example Locations Today s examples are located in the training account home directories: ~/XSEDE14 Following subdirectories should be visible: ls ~/XSEDE14 ANAGRAM GUTENBERG mahout Pig2 streaming

42 What is Gordon? 1024 compute nodes, 64 IO nodes 16 cores of Intel Xeon E GB of DDR3 DRAM 300 GB enterprise SSD QDR InfiniBand 40 Gbps QDR4X (cf. 10 Gb Ethernet) two rails 3D torus topology Torque batch system

43 Request Nodes with Torque Rocks 6.1 (Emerald Boa) Profile built 13: Aug Kickstarted 13: Aug WELCOME TO / \ / / / / \ / / / \_ / _ / \ / / \ \ / /_ /_/ / / // / / /_/ / / /_/ / / / /_/ / / /_/ / / / / / / / / / / \ / \ / \ //_/ \,_/ \ //_/ /_/ ***************************************************************************** gordon- ln1:~$ qsub I l nodes=2:ppn=16:native:flash,walltime=4:00:00 q normal qsub: waiting for job gordon- fe2.local to start qsub: job gordon- fe2.local ready gcn- 3-15:~$

44 Request Nodes with Torque qsub I l nodes=2:ppn=16:native:flash,walltime=4:00:00 q normal v Catalina_maxhops=None,QOS=0 - I request Interactive access to your nodes - l nodes=2 request two nodes ppn=16 request 16 processors per node native request native nodes (not vsmp VMs) flash request nodes with SSDs walltime=4:00:00 reserve these nodes for four hours - q normal not requesting any special queues Use the idev alias so you don't have to memorize this!

45 Hadoop Installation & Configuration GENERAL INSTALLATION AND CONFIGURATION

46 Installing Hadoop on Linux 1. Download Hadoop tarball from Apache ( $ wget hadoop /hadoop bin.tar.gz... Length: (36M) [application/x- gzip] Saving to: hadoop bin.tar.gz.2 85% [================================> ] 32,631, M/s eta 2s 2. Unpack tarball $ tar zxvf hadoop bin.tar.gz hadoop /... hadoop /src/contrib/ec2/bin/list- hadoop- clusters hadoop /src/contrib/ec2/bin/terminate- hadoop- cluster

47 Configure Your First Hadoop Cluster (on Gordon) Only necessary once per qsub gcn- 3-15:~$ myhadoop- configure.sh myhadoop: Keeping HADOOP_HOME=/home/diag/opt/hadoop/hadoop from user environment myhadoop: Setting MH_IPOIB_TRANSFORM='s/$/.ibnet0/' from myhadoop.conf myhadoop: Setting MH_SCRATCH_DIR=/scratch/$USER/$PBS_JOBID from myhadoop.conf myhadoop: Keeping HADOOP_CONF_DIR=/home/glock/mycluster from user environment myhadoop: Using HADOOP_HOME=/home/diag/opt/hadoop/hadoop myhadoop: Using MH_SCRATCH_DIR=/scratch/glock/ gordon- fe2.local myhadoop: Using JAVA_HOME=/usr/java/latest myhadoop: Generating Hadoop configuration in directory in /home/glock/mycluster... myhadoop: Designating gcn ibnet0 as master node myhadoop: The following nodes will be slaves: gcn ibnet0 gcn ibnet0... /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at gcn sdsc.edu/ ************************************************************/

48 myhadoop Interface between batch system (qsub) and Hadoop myhadoop- configure.sh is not a part of Hadoop gets node names from batch environment configures InfiniBand support generates Hadoop config files formats namenode so HDFS is ready to go

49 Configuring Hadoop $HADOOP_CONF_DIR is the cluster all of your cluster's global configurations all of your cluster's nodes all of your cluster's node configurations** echo $HADOOP_CONF_DIR to see your cluster's configuration location cd $HADOOP_CONF_DIR $ ls capacity- scheduler.xml hadoop- policy.xml myhadoop.conf configuration.xsl hdfs- site.xml slaves core- site.xml log4j.properties ssl- client.xml.example fair- scheduler.xml mapred- queue- acls.xml ssl- server.xml.example hadoop- env.sh mapred- site.xml taskcontroller.cfg hadoop- metrics2.properties masters

50 core-site.xml <property> <name>hadoop.tmp.dir</name> <value>/scratch/glock/ gordon- fe2.local/tmp</value> <description>a base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://gcn ibnet0:54310</value> <description>the name of the default file system. </description> </property>

51 core-site.xml <property> <name>io.file.buffer.size</name> <value>131072</value> <description>the size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. </description> </property> More parameters and their default values:

52 hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/scratch/glock/ gordon- fe2.local/namenode_data </value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/scratch/glock/ gordon- fe2.local/hdfs_data</value> <final>true</final> </property>

53 hdfs-site.xml <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.block.size</name> <value> </value> </property> More parameters and their default values:

54 mapred-site.xml <property> <name>mapred.job.tracker</name> <value>gcn ibnet0:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in- process as a single map and reduce task. </description> </property> <property> <name>mapred.local.dir</name> <value>/scratch/glock/ gordon- fe2.local/mapred_scratch </value> <final>true</final> </property>

55 mapred-site.xml <property> <name>io.sort.mb</name> <value>650</value> <description>higher memory- limit while sorting data.</description> </property> <property> <name>mapred.map.tasks</name> <value>4</value> <description>the default number of map tasks per job.</description> </property> <property> <name>mapred.reduce.tasks</name> <value>4</value> <description>the default number of reduce tasks per job. </description> </property>

56 mapred-site.xml <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description>the maximum number of map tasks that will be run simultaneously by a task tracker.</description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description>the maximum number of reduce tasks that will be run simultaneously by a task tracker.</description> </property> More parameters and their default values:

57 hadoop-env.sh Establishes environment variables for all Hadoop components Essentials: HADOOP_LOG_DIR location of Hadoop logs HADOOP_PID_DIR location of Hadoop PID files JAVA_HOME location of Java that Hadoop should use Other common additions LD_LIBRARY_PATH - for mappers/reducers HADOOP_CLASSPATH - for mappers/reducers _JAVA_OPTIONS - to hack global Java options

58 Hadoop Installation & Configuration HANDS-ON CONFIGURATION

59 masters / slaves Used by start- all.sh and stop- all.sh Hadoop control scripts $ cat masters gcn ibnet0 $ cat slaves gcn ibnet0 gcn ibnet0 masters secondary namenodes slaves datanodes + tasktrackers namenode and jobtracker are launched on localhost

60 Launch Your First Hadoop Cluster (on Gordon) $ start- all.sh starting namenode, logging to /scratch/glock/ gordon- fe2.local/logs/hadoop- gloc gcn ibnet0: starting datanode, logging to /scratch/glock/ gordon- fe2.loca gcn ibnet0: starting datanode, logging to /scratch/glock/ gordon- fe2.loca gcn ibnet0: starting secondarynamenode, logging to /scratch/glock/ gordon starting jobtracker, logging to /scratch/glock/ gordon- fe2.local/logs/hadoop- gl gcn ibnet0: starting tasktracker, logging to /scratch/glock/ gordon- fe2.l gcn ibnet0: starting tasktracker, logging to /scratch/glock/ gordon- fe2.l Verify that namenode and datanodes came up: $ hadoop dfsadmin - report Configured Capacity: ( GB) Present Capacity: ( GB) DFS Remaining: ( GB) DFS Used: (16.03 KB) DFS Used%: 0% Under replicated blocks: 1 Blocks with corrupt replicas: 0...

61 Try it Yourself: Change Replication 1. Shut down cluster to change configuration $ stop- all.sh gcn ibnet0: stopping tasktracker gcn ibnet0: stopping tasktracker stopping namenode gcn ibnet0: stopping datanode gcn ibnet0: stopping datanode gcn ibnet0: stopping secondarynamenode 2. Change dfs.replication setting in hdfs- site.xml by adding the following: <property> <name>dfs.replication</name> <value>2</value> </property> 3. Restart (start- all.sh) and re-check

62 Try it Yourself: Change Replication Are the under-replicated blocks gone? What does this mean for actual runtime behavior?

63 Run Some Simple Commands Get a small set of data (a bunch of classic texts) and copy into HDFS: $ ls lh gutenberg.txt - rw- r- - r- - 1 glock sdsc 407M Mar 16 14:09 gutenberg.txt $ hadoop dfs - mkdir data $ hadoop dfs - put gutenberg.txt data/ $ hadoop dfs - ls data Found 1 items - rw- r- - r- - 3 glock supergroup :21 /user/glock/data/gutenberg.txt

64 Run Your First Map/Reduce Job $ hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount data wordcount- output 14/03/16 14:24:18 INFO input.fileinputformat: Total input paths to process : 1 14/03/16 14:24:18 INFO util.nativecodeloader: Loaded the native- hadoop library 14/03/16 14:24:18 WARN snappy.loadsnappy: Snappy native library not loaded 14/03/16 14:24:18 INFO mapred.jobclient: Running job: job_ _ /03/16 14:24:19 INFO mapred.jobclient: map 0% reduce 0% 14/03/16 14:24:35 INFO mapred.jobclient: map 15% reduce 0%... 14/03/16 14:25:23 INFO mapred.jobclient: map 100% reduce 28% 14/03/16 14:25:32 INFO mapred.jobclient: map 100% reduce 100% 14/03/16 14:25:37 INFO mapred.jobclient: Job complete: job_ _ /03/16 14:25:37 INFO mapred.jobclient: Counters: 29 14/03/16 14:25:37 INFO mapred.jobclient: Job Counters 14/03/16 14:25:37 INFO mapred.jobclient: Launched reduce tasks= /03/16 14:25:37 INFO mapred.jobclient: Launched map tasks=7 14/03/16 14:25:37 INFO mapred.jobclient: Data- local map tasks=7

65 Running Hadoop Jobs hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount data wordcount- output hadoop jar launch a map/reduce job $HADOOP_HOME/hadoop- examples jar map/reduce application jar wordcount command-line argument for jarfile data location of input data (file or directory) wordcount- output location to dump job output

66 Check Job Output $ hadoop dfs - ls wordcount- output Found 3 items... glock supergroup... /user/glock/wordcount- output/_success... glock supergroup... /user/glock/wordcount- output/_logs... glock supergroup... /user/glock/wordcount- output/part- r glock supergroup... /user/glock/wordcount- output/part- r _SUCCESS if exists, job finished successfully _logs directory containing job history part- r output of reducer #0 part- r output of reducer #1 part- r- 0000n output of nth reducer

67 Check Job Output $ hadoop job history wordcount- output Hadoop job: 0010_ _glock =====================================... Submitted At: 16- Mar :24:18 Launched At: 16- Mar :24:18 (0sec) Finished At: 16- Mar :25:36 (1mins, 18sec)... Task Summary ============================ Kind Total Successful Failed Killed StartTime FinishTime Setup Mar :24: Mar :24:23 (4sec) Map Mar :24: Mar :25:14 (49sec) Reduce Mar :24: Mar :25:29 (40sec) Cleanup Mar :25: Mar :25:35 (4sec) ============================

68 Retrieve Job Output $ hadoop dfs - get wordcount- output/part*. $ ls - lrtph... - rw- r- - r- - 1 glock sdsc 24M Mar 16 14:31 part- r $ hadoop dfs - rmr wordcount- output Deleted hdfs://gcn ibnet0:54310/user/glock/wordcount- output $ hadoop dfs - ls Found 1 items... glock supergroup... /user/glock/data

69 Silly First Example 1. How many mappers were used? 2. How many reducers were used? 3. If Map and Reduce phases are real calculations, and Setup and Cleanup are "job overhead," what fraction of time was spent on overhead?

70 User-Level Configuration Changes Bigger HDFS block size More reducers $ hadoop dfs - Ddfs.block.size= put./gutenberg.txt data/gutenberg- 128M.txt $ hadoop fsck - block /user/glock/data/gutenberg- 128M.txt... Total blocks (validated): 4 (avg. block size B)... $ hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount - Dmapred.reduce.tasks=4 data/gutenberg- 128M.txt output- 128M

71 User-Level Configuration Changes 1. How many mappers were used before and after specifying the user-level tweaks? What caused this change? 2. How much speedup did these tweaks give? Any ideas why? 3. What happens if you use mapred.reduce.tasks=8 or =16? Any ideas why? 4. hadoop dfs - ls output- 128M How many output files are there now? Why?

72 TestDFSIO Test to verify HDFS read/write performance Generates its own input via - write test Make 8 files, each 1024 MB large: $ hadoop jar $HADOOP_HOME/hadoop- test jar TestDFSIO - write - nrfiles 8 - filesize /03/16 16:17:20 INFO fs.testdfsio: nrfiles = 8 14/03/16 16:17:20 INFO fs.testdfsio: filesize (MB) = /03/16 16:17:20 INFO fs.testdfsio: buffersize = /03/16 16:17:20 INFO fs.testdfsio: creating control file: 1024 mega bytes, 8 files 14/03/16 16:17:20 INFO fs.testdfsio: created control files for: 8 files... 14/03/16 16:19:04 INFO fs.testdfsio: TestDFSIO : write 14/03/16 16:19:04 INFO fs.testdfsio: Date & time: Sun Mar 16 16:19:04 PDT /03/16 16:19:04 INFO fs.testdfsio: Number of files: 8 14/03/16 16:19:04 INFO fs.testdfsio: Total MBytes processed: /03/16 16:19:04 INFO fs.testdfsio: Throughput mb/sec: /03/16 16:19:04 INFO fs.testdfsio: Average IO rate mb/sec: /03/16 16:19:04 INFO fs.testdfsio: IO rate std deviation: /03/16 16:19:04 INFO fs.testdfsio: Test exec time sec:

73 TestDFSIO Test the read performance use - read instead of - write: Why such a big performance difference? $ hadoop jar $HADOOP_HOME/hadoop- test jar TestDFSIO - read - nrfiles 8 - filesize /03/16 16:19:59 INFO fs.testdfsio: nrfiles = 8 14/03/16 16:19:59 INFO fs.testdfsio: filesize (MB) = /03/16 16:19:59 INFO fs.testdfsio: buffersize = /03/16 16:19:59 INFO fs.testdfsio: creating control file: 1024 mega bytes, 8 files 14/03/16 16:20:00 INFO fs.testdfsio: created control files for: 8 files... 14/03/16 16:20:40 INFO fs.testdfsio: TestDFSIO : read 14/03/16 16:20:40 INFO fs.testdfsio: Date & time: Sun Mar 16 16:20:40 PDT /03/16 16:20:40 INFO fs.testdfsio: Number of files: 8 14/03/16 16:20:40 INFO fs.testdfsio: Total MBytes processed: /03/16 16:20:40 INFO fs.testdfsio: Throughput mb/sec: /03/16 16:20:40 INFO fs.testdfsio: Average IO rate mb/sec: /03/16 16:20:40 INFO fs.testdfsio: IO rate std deviation: /03/16 16:20:40 INFO fs.testdfsio: Test exec time sec:

74 HDFS Read/Write Performance 1. What part of the HDFS architecture might explain the difference in how long it takes to read 8x1 GB files vs. write 8x1 GB files? 2. What configuration have we covered that might alter this performance gap? 3. Adjust this parameter, restart the cluster, and re-try. How close can you get the read/write performance gap?

75 TeraSort Heavy-Duty Map/Reduce Job Standard benchmark to assess performance of Hadoop's MapReduce and HDFS components teragen Generate a lot of random input data terasort Sort teragen's output teravalidate Verify that terasort's output is sorted

76 Running TeraGen hadoop jar $HADOOP_HOME/hadoop- examples jar teragen terasort- input teragen launch the teragen application how many 100-byte records to generate (100,000,000 ~ 9.3 GB) terasort- input location to store output data to be fed into terasort

77 Running TeraSort $ hadoop jar $HADOOP_HOME/hadoop- examples jar teragen terasort- input Generating using 2 maps with step of $ hadoop jar $HADOOP_HOME/hadoop- examples jar terasort terasort- input terasort- output 14/03/16 16:43:00 INFO terasort.terasort: starting 14/03/16 16:43:00 INFO mapred.fileinputformat: Total input paths to process : 2...

78 TeraSort Default Behavior Default settings are suboptimal: teragen use 2 maps, so TeraSort input split between 2 files terasort use 1 reducer so reduction is serial Try optimizing TeraSort: Double dfs.block.size to (128MB) Increase mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to increase concurrency Launch with more reducers (- Dmapred.reduce.tasks=XX)

79 Summary of Commands ### Request two nodes for four hours on Gordon $ qsub - I - l nodes=2:ppn=16:native:flash,walltime=4:00:00 - q normal ### Configure $HADOOP_CONF_DIR on Gordon $ myhadoop- configure.sh ### Hadoop control script to start all nodes $ start- all.sh ### Verify HDFS is online $ hadoop dfsadmin - report ### Copy file to HDFS $ hadoop dfs - put somelocalfile hdfsdir/ ### View file information on HDFS $ hadoop fsck - block hdfsdir/somelocalfile ### Run a map/reduce job $ hadoop jar somejarfile.jar - option1 - option2 ### View job info after it completes $ hadoop job - history hdfsdir/outputdir ### Shut down all Hadoop nodes $ stop- all.sh ### Copy logfiles back from nodes $ myhadoop- cleanup.sh

80 Anagram Example Source: Uses Map-Reduce approach to process a file with a list of words, and identify all the anagrams in the file Code is written in Java. Example has already been compiled and the resulting jar file is in the example directory.

81 Anagram Map Class (Detail) String word = value.tostring(); Convert the word to a string char[] wordchars = word.tochararray(); Assign to character array Arrays.sort(wordChars); Sort the array of characters String sortedword = new String(wordChars); Create new string with sorted characters sortedtext.set(sortedword); orginaltext.set(word); outputcollector.collect(sortedtext, orginaltext); Prepare and output the sorted text string (serves as the key), and the original test string.

82 Anagram Map Class (Detail) Consider file with list of words : alpha, hills, shill, truck Alphabetically sorted words : aahlp, hills, hills, ckrtu Hence after the Map Step is done, the following key pairs would be generated: (aahlp,alpha) (hills,hills) (hills,shill) (ckrtu,truck)

83 Anagram Example Reducer (Detail) while(anagramvalues.hasnext()) { Text anagam = anagramvalues.next(); output = output + anagam.tostring() + "~"; } Iterate over all the values for a key. hasnext() is a Java method that allows you to do this. We are also creating an output string which has all the words separated with ~. StringTokenizer outputtokenizer = new StringTokenizer(output,"~ ); StringTokenizer class allows you to store this delimited string and has functions that allow you to count the number of tokens. if(outputtokenizer.counttokens()>=2) { output = output.replace("~", ","); outputkey.set(anagramkey.tostring()); outputvalue.set(output); results.collect(outputkey, outputvalue); } We output the anagram key and the word lists if the number of tokens is >=2 (i.e. we have an anagram pair).

84 Anagram Reducer Class (Detail) For our example set, the input to the Reducers is: (aahlp,alpha) (hills,hills) (hills,shill) (ckrtu,truck) The only key with #tokens >=2 is <hills>. Hence, the Reducer output will be: hills hills, shill,

85 Anagram Example Submit Script #!/bin/bash #PBS -N Anagram #PBS -q normal #PBS -l nodes=2:ppn=16:native #PBS -l walltime=01:00:00 #PBS -m e #PBS -M your @xyz.com #PBS -V #PBS -v Catalina_maxhops=None cd $PBS_O_WORKDIR myhadoop-configure.sh start-all.sh hadoop dfs -mkdir input hadoop dfs -copyfromlocal $PBS_O_WORKDIR/SINGLE.TXT input/ hadoop jar $PBS_O_WORKDIR/AnagramJob.jar input/single.txt output hadoop dfs -copytolocal output/part* $PBS_O_WORKDIR stop-all.sh myhadoop-cleanup.sh

86 Anagram Example Sample Output cat part aabcdelmnu manducable,ambulanced, aabcdeorrsst broadcasters,rebroadcasts, aabcdeorrst rebroadcast,broadcaster, aabcdkrsw drawbacks,backwards, aabcdkrw drawback,backward, aabceeehlnsst teachableness,cheatableness, aabceeelnnrsstu uncreatableness,untraceableness, aabceeelrrt recreatable,retraceable, aabceehlt cheatable,teachable, aabceellr lacerable,clearable, aabceelnrtu uncreatable,untraceable, aabceelorrrstv vertebrosacral,sacrovertebral,

87 Hadoop Streaming Programming Hadoop without Java

88 Hadoop and Python Hadoop streaming w/ Python mappers/reducers portable most difficult (or least difficult) to use you are the glue between Python and Hadoop mrjob (or others: hadoopy, dumbo, etc) comprehensive integration Python interface to Hadoop streaming Analogous interface libraries exist in R, Perl Can interface directly with Amazon

89 Wordcount Example

90 Hadoop Streaming with Python "Simplest" (most portable) method Uses raw Python, Hadoop you are the glue cat input.txt mapper.py sort reducer.py > output.txt provide these two scripts; Hadoop does the rest generalizable to any language you want (Perl, R, etc)

91 Spin Up a Cluster $ idev qsub: waiting for job gordon- fe2.local to start qsub: job gordon- fe2.local ready $ myhadoop- configure.sh myhadoop: Keeping HADOOP_HOME=/home/diag/opt/hadoop/hadoop from user environment myhadoop: Setting MH_IPOIB_TRANSFORM='s/$/.ibnet0/' from myhadoop.conf... $ start- all.sh starting namenode, logging to /scratch/train18/ gordon- fe2.local/logs/hadoop- train18- namenode- gcn sdsc.edu.out gcn ibnet0: starting datanode, logging to /scratch/train18/ gordon- fe2.local/logs/ hadoop- train18- datanode- gcn sdsc.edu.out $ cd PYTHON/streaming $ ls README mapper.py pg2701.txt pg996.txt reducer.py

92 What One Mapper Does line = Call me Ishmael. Some years ago never mind how long keys = Call me Ishmael. Some years ago--never mind how long emit.keyval(key,value)... Call 1 me Ishmael. 1 Some 1 1 years 1 mind 1 ago--never how 1 long 1 1 to the reducers

93 Wordcount: Hadoop streaming mapper #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() keys = line.split() for key in keys: value = 1 print( '%s\t%d' % (key, value) )...

94 Reducer Loop Logic For each key/value pair... If this key is the same as the previous key, Ø add this key's value to our running total. Otherwise, Ø print out the previous key's name and the running total, Ø reset our running total to 0, Ø add this key's value to the running total, and Ø "this key" is now considered the "previous key"

95 Wordcount: Streaming Reducer (1/2) #!/usr/bin/env python import sys last_key = None running_total = 0 for input_line in sys.stdin: input_line = input_line.strip() this_key, value = input_line.split("\t", 1) value = int(value) (to be continued...)

96 Wordcount: Streaming Reducer (2/2) if last_key == this_key: running_total += value else: if last_key: print( "%s\t%d" % (last_key, running_total) ) running_total = value last_key = this_key if last_key == this_key: print( "%s\t%d" % (last_key, running_total) )

97 Testing Mappers/Reducers Debugging Hadoop is not fun $ head - n100 pg2701.txt./mapper.py sort./reducer.py... with 5 word, 1 world you 3 The 1

98 Launching Hadoop Streaming hadoop jar $HADOOP_HOME/contrib/streaming/hadoop- streaming jar - D mapred.reduce.tasks=2 - mapper "$PWD/mapper.py" - reducer "$PWD/reducer.py" - input mobydick.txt - output output.../hadoop- streaming jar - mapper $PWD/mapper.py - reducer $PWD/reducer.py - input mobydick.txt - output output Hadoop Streaming jarfile Mapper executable Reducer executable location (on hdfs) of job input location (on hdfs) of output dir

99 Tools/Applications w/ Hadoop Several open source projects utilizing Hadoop infrastructure. Examples: HIVE A data warehouse infrastructure providing data summarization and querying. Hbase Scalable distributed database PIG High-level data-flow and execution framework Elephantdb Distributed database specializing in exporting key/value data from Hadoop. Some of these projects have been tried on Gordon by users.

100 Apache Mahout Apache project to implement scalable machine learning algorithms. Algorithms implemented: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier

101 Mahout Example Recommendation Mining Collaborative Filtering algorithms aim to solve the prediction problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen. Itembased Collaborative Filtering estimates a user's preference towards an item by looking at his/her preferences towards similar items. Mahout has two Map/Reduce jobs to support Itembased Collaborative Filtering ItemSimiliarityJob Computes similarity values based on input preference data RecommenderJob - Uses input preference data to determine recommended itemids and scores. Can compute Top N recommendations for user for example. Collaborative Filtering is very popular (for example used by Google, Amazon in their recommendations)

102 Mahout Recommendation Mining Once the preference data is collected, a user-item-matrix is created. Task is to predict missing entries based on the known data. Estimate a user s preference for a particular item based on their preferences to similar items. Example: Item 1 Item 2 Item 3 User User 2 5 2? User More details and links to informative presentations at:

103 Mahout Hands on Example Recommendation Job example : Mahout.cmd qsub Mahout.cmd to submit the job. Script also illustrates use of a custom hadoop configuration file. Customized info in mapredsite.xml.stub file which is copied into the configuration used.

104 Mahout Hands On Example The input file is a comma separated list of userids, itemids, and preference scores (test.data). As mentioned in earlier slide, all users need not score all items. In this case there are 5 users and 4 items. The users.txt file provides the list of users for whom we need recommendations. The output from the Mahout recommender job is a file with userids and recommendations (with scores). The show_reco.py file uses the output and combines with user, item info from another input file (items.txt) to produce details recommendation info. This segment runs outside of the hadoop framework.

105 Mahout Sample Output User ID : 2 Rated Items Item 2, rating=2 Item 3, rating=5 Item 4, rating= Recommended Items Item 1, score=

106 PIG PIG - high level programming on top of Hadoop map/reduce Sql-like data flow operations Essentially, key,values abstracted to fields and more complex data strutures

107 Hot off the press on Gordon! Hadoop2 (v2.2.0) installed and available via a module. Apache Spark installed and available via a module. Configured to run with Hadoop2. myhadoop extended to handle Hadoop2/YARN, and Spark.

108 Summary Hadoop framework extensively used for scalable distributed processing of large datasets. Several tools available that leverage the framework. HDFS provides scalable filesystem tailored for accelerating MapReduce performance. Hadoop Streaming can be used to write mapper/reduce functions in any language. Data warehouse (HIVE), scalable distributed database (HBASE), and execution framework (PIG) available. Mahout w/ Hadoop provides scalable machine learning tools.

109 Exercises: Log in to gordon PIG (Hands On) > >qsub -I -q normal -lnodes=2:ppn=1:native,walltime=2:00:00 -A ###### >cd Pig2 > > ps axu grep "p4rod > source Setup_pigtwt.cmd > source Setup_Data4pig.cmd > ps axu grep "p4rod >pig grunt>

110 PIG Hands On - Output > ls Cleanup_pigtwt.cmd Run_pigtwtsrc.qsub hadoophosts.txt pig_wds_grp_cnt.txt Doscript_pigtwt.cmd SetupData4pig.cmd pig_ log rawtw_dtmsg.csv GetDataFrompig.cmd Setup_pigtwt.cmd pig_out_mon_day_cnt.txt tw_script1.pig Setup_4twt gcn ibnet0: starting tasktracker, logging to /scratch/p4rodrig/ gordon-fe2.local/hadoop-p4rodrig/log/hadoop-p4rodrigtasktracker-gcn sdsc.edu.out Warning: $HADOOP_HOME is deprecated Set up Data :11:54,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.hexecutionengine - Connecting to hadoop file system at: hdfs://gcn ibnet0: :11:54,727 [main] INFO org.apache.pig.backend.hadoop.executionengine.hexecutionengine - Connecting to map-reduce job tracker at: gcn ibnet0:54311

111 PIG (Hands On) #watch out for _=_ keep spacing around equal sign raw = LOAD 'pigdata2load.txt' USING PigStorage(',') AS (mon_day,year,hour,msg,place,retwtd,uid,flers_cnt,fling_cnt); tmp = LIMIT raw 5; DUMP tmp; describe raw rawstr = LOAD 'pigdata2load.txt' USING PigStorage(',') AS (mon_day:chararray,year:int,hour:chararray,msg:chararray,place:chararray,retwted:char array,uid:int,flers_cnt:int,fling_cnt:int); describe rawstr

112 PIG Hands On (output) 1 dump tmp INFO org.apache.pig.backend.hadoop.executionengine.util.mapredutil - Total input paths to process : 1 (Apr 03,2012,09:11:56,havent you signe the petition yet hrd alkhawaja in his 55 day of hungerstrike well send it to obama httpstco8ww2sfzx bahrain,null,false, ,185,null ) (Apr 08,2012,07:02:08,rt azmoderate uc chuckgrassley pres obama has clearly demonstrated his intelligence you on the other hand look sound like you shou,null,false, ,904,null ) (Apr 12,2012,13:02:44,rt dcgretch 2 describe raw raw: {mon_day: bytearray,year: bytearray,hour: bytearray,msg: bytearray,place: bytearray,retwtd: bytearray,uid: bytearray,flers_cnt: bytearray,fling_cnt: bytearray} 3 describe rawstr rawstr: {mon_day: chararray,year: int,hour: chararray,msg: chararray,place: chararray,retwted: chararray,uid: int,flers_cnt: int,fling_cnt: int}

113 PIG (Hands On) rawbymon_day = GROUP rawstr BY mon_day; -- mon_day_cnt = FOREACH rawbymon_day GENERATE group,count(*); -- return error about cast mon_day_cnt = FOREACH rawbymon_day GENERATE group,count(rawstr); tmpc = LIMIT mon_day_cnt 5; DUMP tmpc; STORE mon_day_cnt INTO 'pig_out_mon_day_cnt.txt';

114 PIG Hands On - output INFO org.apache.pig.tools.pigstats.simplepigstats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features p4rodrig :19: :20:35 GROUP_BY Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_ _ mon_day_cnt,rawbymon_day,rawstr GROUP_BY,COMBINER hdfs://gcn ibnet0:54310/user/ p4rodrig/pig_out_mon_day_cnt.txt, Input(s): Successfully read 109 records (13996 bytes) from: "hdfs://gcn ibnet0:54310/user/p4rodrig/ pigdata2load.txt"

115 PIG (Hands On) raw_msg = FOREACH rawstr {mwds=tokenize(msg); GENERATE $0,$1,$2,mwds;}; tmpd= LIMIT raw_msg 5; DUMP tmpd; raw_msgf = FOREACH rawstr {mwds=tokenize(msg); GENERATE $0,$1,$2,flatten(mwds);}; tmpd = LIMIT raw_msg 5; DUMP tmpd;

116 PIG Hands On - output First dump (Apr 03,2012,09:11:56,{(havent),(you),(signe),(the),(petition),(yet),(hrd),(alkhawaja),(in),(his),(55),(day),(of), (hungerstrike),(well),(send),(it),(to),(obama),(httpstco8ww2sfzx),(bahrain)}) (Apr 08,2012,07:02:08,{(rt),(azmoderate),(uc),(chuckgrassley),(pres),(obama),(has),(clearly),(demonstrated), (his),(intelligence),(you),(on),(the),(other),(hand),(look),(sound),(like),(you),(shou)}) Second dump (Apr 03,2012,09:11:56,{(havent),(you),(signe),(the),(petition),(yet),(hrd),(alkhawaja),(in),(his),(55),(day), (of),(hungerstrike),(well),(send),(it),(to),(obama),(httpstco8ww2sfzx),(bahrain)}) (Apr 08,2012,07:02:08,{(rt),(azmoderate),(uc),(chuckgrassley),(pres),(obama),(has),(clearly), (demonstrated),(his),(intelligence),(you),(on),(the),(other),(hand),(look),(sound),(like),(you),(shou)})

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing

More information

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni, Rick Wagner, Eva Hocks, Christopher Irving, and Jerry Greenberg! SDSC! XSEDE13, July 22-25, 2013! Overview!!

More information

RDMA for Apache Hadoop 0.9.9 User Guide

RDMA for Apache Hadoop 0.9.9 User Guide 0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)

More information

Parallel Options for R

Parallel Options for R Parallel Options for R Glenn K. Lockwood SDSC User Services [email protected] Motivation "I just ran an intensive R script [on the supercomputer]. It's not much faster than my own machine." Motivation "I

More information

How To Use Hadoop

How To Use Hadoop Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

More information

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics

Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW

More information

Hadoop on the Gordon Data Intensive Cluster

Hadoop on the Gordon Data Intensive Cluster Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Hadoop (pseudo-distributed) installation and configuration

Hadoop (pseudo-distributed) installation and configuration Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under

More information

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

Distributed Filesystems

Distributed Filesystems Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls

More information

HDFS Installation and Shell

HDFS Installation and Shell 2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all

More information

HDFS Users Guide. Table of contents

HDFS Users Guide. Table of contents Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9

More information

Introduction to SDSC systems and data analytics software packages "

Introduction to SDSC systems and data analytics software packages Introduction to SDSC systems and data analytics software packages " Mahidhar Tatineni ([email protected]) SDSC Summer Institute August 05, 2013 Getting Started" System Access Logging in Linux/Mac Use available

More information

HSearch Installation

HSearch Installation To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all

More information

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software

More information

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)

How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop) Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create

More information

Running Kmeans Mapreduce code on Amazon AWS

Running Kmeans Mapreduce code on Amazon AWS Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:

More information

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation 1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

MapReduce Evaluator: User Guide

MapReduce Evaluator: User Guide University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview

More information

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to HDFS. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor

School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Hadoop for HPC Instructor: Ekpe Okorafor Outline Hadoop Basics Hadoop Infrastructure HDFS MapReduce Hadoop & HPC Hadoop

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org June 3 rd, 2008 Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed

More information

Hadoop Training Hands On Exercise

Hadoop Training Hands On Exercise Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved. Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework

More information

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela

Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - Hadoop Implementation on Riptide 2 Table of Contents Executive

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

5 HDFS - Hadoop Distributed System

5 HDFS - Hadoop Distributed System 5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015 CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document

More information

Installation and Configuration Documentation

Installation and Configuration Documentation Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

HDFS. Hadoop Distributed File System

HDFS. Hadoop Distributed File System HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Introduction to HDFS. Prasanth Kothuri, CERN

Introduction to HDFS. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Extreme computing lab exercises Session one

Extreme computing lab exercises Session one Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this

More information

CSE-E5430 Scalable Cloud Computing. Lecture 4

CSE-E5430 Scalable Cloud Computing. Lecture 4 Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

A. Aiken & K. Olukotun PA3

A. Aiken & K. Olukotun PA3 Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture

More information

Sriram Krishnan, Ph.D. [email protected]

Sriram Krishnan, Ph.D. sriram@sdsc.edu Sriram Krishnan, Ph.D. [email protected] (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce

More information

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone

More information

Single Node Hadoop Cluster Setup

Single Node Hadoop Cluster Setup Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps

More information

HADOOP MOCK TEST HADOOP MOCK TEST I

HADOOP MOCK TEST HADOOP MOCK TEST I http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay

More information

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur June, 2007 Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle

More information

Chase Wu New Jersey Ins0tute of Technology

Chase Wu New Jersey Ins0tute of Technology CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at

More information

Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

The Hadoop Distributed File System

The Hadoop Distributed File System The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS

More information

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Can High-Performance Interconnects Benefit Memcached and Hadoop? Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions

Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information