Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer
|
|
|
- Samson Garrett
- 10 years ago
- Views:
Transcription
1 Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer Glenn Lockwood and Mahidhar Tatineni User Services Group San Diego Supercomputer Center XSEDE14, Atlanta July 14, 2014
2 Overview Hadoop framework extensively used for scalable distributed processing of large datasets. Hadoop is built to process data in orders of several hundred gigabytes to several terabytes (and even petabytes at the extreme end). Data sizes are much bigger than the capacities (both disk and memory) of individual nodes. Under Hadoop Distributed Filesystem (HDFS), data is split into chunks which are managed by different nodes. Data chunks are replicated across several machines to provide redundancy in case of an individual node failure. Processing must conform to Map-Reduce programming model. Processes are scheduled close to location of data chunks being accessed.
3 Hadoop: Application Areas Hadoop is widely used in data intensive analysis. Some application areas include: Log aggregation and processing Video and Image analysis Data mining, Machine learning Indexing Recommendation systems Data intensive scientific applications can make use of the Hadoop MapReduce framework. Application areas include: Bioinformatics and computational biology Astronomical image processing Natural Language Processing Geospatial data processing Some Example Projects Genetic algorithms, particle swarm optimization, ant colony optimization Big data for business analytics (class) Hadoop for remote sensing analysis Extensive list online at:
4 Hadoop Architecture Master Node Jobtracker Namenode Network/ Switching Compute/ Datanode Compute/ Datanode Compute/ Datanode Compute/ Datanode Tasktracker Tasktracker Tasktracker Tasktracker Map/Reduce Framework HDFS Distributed Filesystem Software to enable distributed computation. Jobtracker schedules and manages map/ reduce tasks. Tasktracker does the execution of tasks on the nodes. Metadata handled by the Namenode. Files are split up and stored on datanodes (typically local disk). Scalable and fault tolerance. Replication is done asynchronously.
5 Hadoop Workflow
6 Map Reduce Basics The MapReduce algorithm is a scalable (thousands of nodes) approach to process large volumes of data. The most important aspect is to restrict arbitrary sharing of data. This minimizes the communication overhead which typically constrains scalability. The MapReduce algorithm has two components Mapping and Reducing. The first step involves taking each individual element and mapping it to an output data element using some function. The second reducing step involves aggregating the values from step 1 based on a reducer function. In the hadoop framework both these functions receive key, value pairs and output key, value pairs. A simple word count example illustrates this process in the following slides.
7 Simple Example From Apache Site* Simple wordcount example. Code details: Functions defined - Wordcount map class : reads file and isolates each word - Reduce class : counts words and sums up Call sequence - Mapper class - Combiner class (Reduce locally) - Reduce class - Output *
8 Simple Example From Apache Site* Simple wordcount example. Two input files. File 1 contains: Hello World Bye World File 2 contains: Hello Hadoop Goodbye Hadoop Assuming we use two map tasks (one for each file). Step 1: Map read/parse tasks are complete. Result: Task 1 <Hello, 1> <World, 1> <Bye, 1> <World, 1> Task 2 < Hello, 1> < Hadoop, 1> < Goodbye, 1> <Hadoop, 1> *
9 Simple Example (Contd) Step 2 : Combine on each node, sorted: Task 1 <Bye, 1> <Hello, 1> <World, 2> Task 2 < Goodbye, 1> < Hadoop, 2> <Hello, 1> Step 3 : Global reduce: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>
10 Map/Reduce Execution Process Components Input / Map () / Shuffle / Sort / Reduce () / Output Jobtracker determines number of splits (configurable). Jobtracker selects compute nodes for tasks based on network proximity to data sources. Tasktracker on each compute node manages the tasks assigned and reports back to jobtracker when task is complete. As map tasks complete jobtracker notifies selected task trackers for reduce phase. Job is completed once reduce phase is complete.
11 Hadoop MapReduce Data pipeline Ref: "Hadoop Tutorial from Yahoo!" by Yahoo! Inc. is licensed under a Creative Commons Attribution 3.0 Unported License
12 HDFS Architecture HDFS Client Write NameNode Metadata (Name, Replicas) /home/user1/file1, 3 /home/user1/file2, 3 HDFS Client Read DataNode DataNode DataNode Replicated
13 HDFS Overview HDFS is a block-structured filesystem. Files are split up into blocks of a fixed size (configurable) and stored across the cluster on several nodes. The metadata for the filesystem is stored on the NameNode. Typically the metadata info is cached in the NameNode s memory for fast access. The NameNode provides the list of locations (DataNodes) of blocks for a file and the clients can read directly from the location (without going through the NameNode). The HDFS namespace is completely separate from the local filesystems on the nodes. HDFS has its own tools to list, copy, move, and remove files.
14 HDFS Performance Envelope HDFS is optimized for large sequential streaming reads. The default block size is 64MB (in comparison typical block structured filesystems use 4-8KB, and Lustre uses ~1MB). The default replication is 3. Typical HDFS applications have a write-once-ready many access model. This is taken into account to simplify the data coherency model used. The large block size also means poor random seek times and poor performance with small sized reads/writes. Additionally, given the large file sizes, there is no mechanism for local caching (faster to re-read the data from HDFS). Given these constraints, HDFS should not be used as a general-purpose distributed filesystem for a wide application base.
15 HDFS Configuration Configuration files are in the main hadoop config directory. The file core-site.xml or hdfs-site.xml can be used to change settings. The config directory can either be replicated across a cluster or placed in a shared filesystem (can be an issue if the cluster gets big). Configuration settings are a set of key-value pairs: <property> <name>property-name</name> <value>property-value</value> </property> Adding the line <final>true</final> prevents user override.
16 HDFS Configuration Parameters fs.default.name This is the address which describes the NameNode for the cluster. For example: hdfs://gcn-4-31.ibnet0:54310 dfs.data.dir This is the path on the local filesystem in which the DataNode stores data. Defaults to hadoop tmp location. dfs.name.dir This is where the NameNode metadata is stored. Again defaults to hadoop tmp location. dfs.replication Default replication for each block of data. Default value is 3. dfs.blocksize Default block size. Default value is 64MB Lots of other options. See apache site:
17 HDFS: Listing Files Reminder: Local filesystem commands don t work on HDFS. A simple ls will just end up listing your local files. Need to use HDFS commands. For example: hadoop dfs -ls / Warning: $HADOOP_HOME is deprecated. Found 1 items drwxr-xr-x - mahidhar supergroup :00 /scratch
18 Copying files into HDFS First make some directories in HDFS: hadoop dfs -mkdir /user hadoop dfs -mkdir /user/mahidhar Do a local listing (in this case we have large 50+GB file): ls -lt total rw-r--r-- 1 mahidhar hpss Nov 6 14:48 test.file Copy it into HDFS: hadoop dfs -put test.file /user/mahidhar/ hadoop dfs -ls /user/mahidhar/ Warning: $HADOOP_HOME is deprecated. Found 1 items -rw-r--r-- 3 mahidhar supergroup :10 /user/mahidhar/test.file
19 DFS Command List Command Description Command Description -ls path Lists contents of directory -get src localdest Copy from HDFS to local filesystem -lsr path Recursive display of contents -getmerge src localdest Copy from HDFS and merges into single local file -du path Shows disk usage in bytes -cat filename Display contents of HDFS file -dus path Summary of disk usage -tail file Shows the last 1KB of HDFS file on stdout -mv src dest Move files or directories within HDFS -chmod [-R] Change file permissions in HDFS -cp src dest Copy files or directories within HDFS -chown [-R] Change ownership in HDFS -rm path Removes the file or empty directory in HDFS -chgrp [-R] Change group in HDFS -rmr path Recursively removes file or directory -help Returns usage info -put localsrc dest (Also copyfromlocal) Copy file from local filesystem into HDFS
20 HDFS dfsadmin options Command hadoop dfsadmin -report hadoop dfsadmin metasave filename hadoop dfsadmin safemode options hadoop dfsadmin -refreshnodes Upgrade options hadoop dfsadmin -help Description Generate status report for HDFS Metadata info is saved to file. Cannot be used for restore. Move HDFS to read-only mode Changing HDFS membership. Useful if nodes are being removed from cluster. Status/ finalization Help info
21 hadoop fsck -blocks /user/mahidhar/test.file.4gb Warning: $HADOOP_HOME is deprecated. HDFS fsck FSCK started by mahidhar from / for path /user/mahidhar/test.file.4gb at Wed Nov 06 19:32:30 PST 2013.Status: HEALTHY Total size: B Total dirs: 0 Total files: 1 Total blocks (validated): 64 (avg. block size B) Minimally replicated blocks: 64 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 4 Number of racks: 1 FSCK ended at Wed Nov 06 19:32:30 PST 2013 in 36 milliseconds The filesystem under path '/user/mahidhar/test.file.4gb' is HEALTHY
22 HDFS Distributed Copy Useful to migrate large numbers of files Syntax: hadoop distcp src dest Source/Destinations can be: S3 URLs HDFS URL Filesystem [Useful if it s a parallel filesystem like Lustre]
23 HDFS Architecture: Summary HDFS optimized for large block I/O. Replication(3) turned on by default. HDFS specific commands to manage files and directories. Local filesystem commands will not work in HDFS. HDFS parameters set in xml configuration files. DFS parameters can also be passed with commands. fsck function available for verification. Block rebalancing function available. Java based HDFS API can be used for access within applications. Library available for functionality in C/C++ applications.
24 Useful Links Yahoo tutorial on HDFS: Apache HDFS guide: Brad Hedlund s article on Hadoop architecture :
25 Hadoop and HPC PROBLEM: domain scientists/researchers aren't using Hadoop Hadoop is commonly used in the data analytics industry but not so common in domain science academic areas. Java is not always high-performance Hurdles for domain scientists to learn Java, Hadoop tools. SOLUTION: make Hadoop easier for HPC users use existing HPC clusters and software use Perl/Python/C/C++/Fortran instead of Java make starting Hadoop as easy as possible
26 Compute: Traditional vs. Data-Intensive Traditional HPC CPU-bound problems Solution: OpenMPand MPI-based parallelism Data-Intensive IO-bound problems Solution: Map/reducebased parallelism
27 Architecture for Both Workloads PROs High-speed interconnect Complementary object storage Fast CPUs, RAM Less faulty CONs Nodes aren't storagerich Transferring data between HDFS and object storage* * unless using Lustre, S3, etc backends
28 Add Data Analysis to Existing Compute Infrastructure
29 Add Data Analysis to Existing Compute Infrastructure
30 Add Data Analysis to Existing Compute Infrastructure
31 Add Data Analysis to Existing Compute Infrastructure
32 myhadoop 3-step Install (not reqd. for today s workshop) 1. Download Apache Hadoop 1.x and myhadoop 0.30 $ wget hadoop-1.2.1/hadoop bin.tar.gz $ wget myhadoop-0.30.tar.gz 2. Unpack both Hadoop and myhadoop $ tar zxvf hadoop bin.tar.gz $ tar zxvf myhadoop-0.30.tar.gz 3. Apply myhadoop patch to Hadoop $ cd hadoop-1.2.1/conf $ patch <../myhadoop-0.30/myhadoop patch
33 myhadoop 3-step Cluster 1. Set a few environment variables # sets HADOOP_HOME, JAVA_HOME, and PATH $ module load hadoop $ export HADOOP_CONF_DIR=$HOME/mycluster.conf 2. Run myhadoop-configure.sh to set up Hadoop $ myhadoop-configure.sh -s /scratch/$user/$pbs_jobid 3. Start cluster with Hadoop's start-all.sh $ start-all.sh
34 Advanced Features - Useability System-wide default configurations myhadoop-0.30/conf/myhadoop.conf MH_SCRATCH_DIR specify location of node-local storage for all users MH_IPOIB_TRANSFORM specify regex to transform node hostnames into IP over InfiniBand hostnames Users can remain totally ignorant of scratch disks and InfiniBand Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters myhadoop figures out everything else
35 Advanced Features - Useability Parallel filesystem support HDFS on Lustre via myhadoop persistent mode (-p) Direct Lustre support (IDH) No performance loss at smaller scales for HDFS on Lustre Resource managers supported in unified framework: Torque 2.x and 4.x Tested on SDSC Gordon SLURM 2.6 Tested on TACC Stampede Grid Engine Can support LSF, PBSpro, Condor easily (need testbeds)
36 Hadoop-RDMA Network-Based Computing Lab (Ohio State University) HDFS, MapReduce, and RPC over native InfiniBand and RDMA over Converged Ethernet (RoCE). Based on Apache Hadoop Can use myhadoop to partially set up configuration. Need to add to some of the configuration files. Location on Gordon: /home/diag/opt/hadoop-rdma/hadoop-rdma More details :
37 Data-Intensive Performance Scaling 8-node Hadoop cluster on Gordon 8 GB VCF (Variant Calling Format) file 9x speedup using 2 mappers/node
38 Data-Intensive Performance Scaling 7.7 GB of chemical evolution data, 270 MB/s processing rate
39 Hadoop Installation & Configuration USING GORDON
40 Logging into Gordon Mac/Linux: ssh Windows (PuTTY): gordon.sdsc.edu
41 Example Locations Today s examples are located in the training account home directories: ~/XSEDE14 Following subdirectories should be visible: ls ~/XSEDE14 ANAGRAM GUTENBERG mahout Pig2 streaming
42 What is Gordon? 1024 compute nodes, 64 IO nodes 16 cores of Intel Xeon E GB of DDR3 DRAM 300 GB enterprise SSD QDR InfiniBand 40 Gbps QDR4X (cf. 10 Gb Ethernet) two rails 3D torus topology Torque batch system
43 Request Nodes with Torque Rocks 6.1 (Emerald Boa) Profile built 13: Aug Kickstarted 13: Aug WELCOME TO / \ / / / / \ / / / \_ / _ / \ / / \ \ / /_ /_/ / / // / / /_/ / / /_/ / / / /_/ / / /_/ / / / / / / / / / / \ / \ / \ //_/ \,_/ \ //_/ /_/ ***************************************************************************** gordon- ln1:~$ qsub I l nodes=2:ppn=16:native:flash,walltime=4:00:00 q normal qsub: waiting for job gordon- fe2.local to start qsub: job gordon- fe2.local ready gcn- 3-15:~$
44 Request Nodes with Torque qsub I l nodes=2:ppn=16:native:flash,walltime=4:00:00 q normal v Catalina_maxhops=None,QOS=0 - I request Interactive access to your nodes - l nodes=2 request two nodes ppn=16 request 16 processors per node native request native nodes (not vsmp VMs) flash request nodes with SSDs walltime=4:00:00 reserve these nodes for four hours - q normal not requesting any special queues Use the idev alias so you don't have to memorize this!
45 Hadoop Installation & Configuration GENERAL INSTALLATION AND CONFIGURATION
46 Installing Hadoop on Linux 1. Download Hadoop tarball from Apache ( $ wget hadoop /hadoop bin.tar.gz... Length: (36M) [application/x- gzip] Saving to: hadoop bin.tar.gz.2 85% [================================> ] 32,631, M/s eta 2s 2. Unpack tarball $ tar zxvf hadoop bin.tar.gz hadoop /... hadoop /src/contrib/ec2/bin/list- hadoop- clusters hadoop /src/contrib/ec2/bin/terminate- hadoop- cluster
47 Configure Your First Hadoop Cluster (on Gordon) Only necessary once per qsub gcn- 3-15:~$ myhadoop- configure.sh myhadoop: Keeping HADOOP_HOME=/home/diag/opt/hadoop/hadoop from user environment myhadoop: Setting MH_IPOIB_TRANSFORM='s/$/.ibnet0/' from myhadoop.conf myhadoop: Setting MH_SCRATCH_DIR=/scratch/$USER/$PBS_JOBID from myhadoop.conf myhadoop: Keeping HADOOP_CONF_DIR=/home/glock/mycluster from user environment myhadoop: Using HADOOP_HOME=/home/diag/opt/hadoop/hadoop myhadoop: Using MH_SCRATCH_DIR=/scratch/glock/ gordon- fe2.local myhadoop: Using JAVA_HOME=/usr/java/latest myhadoop: Generating Hadoop configuration in directory in /home/glock/mycluster... myhadoop: Designating gcn ibnet0 as master node myhadoop: The following nodes will be slaves: gcn ibnet0 gcn ibnet0... /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at gcn sdsc.edu/ ************************************************************/
48 myhadoop Interface between batch system (qsub) and Hadoop myhadoop- configure.sh is not a part of Hadoop gets node names from batch environment configures InfiniBand support generates Hadoop config files formats namenode so HDFS is ready to go
49 Configuring Hadoop $HADOOP_CONF_DIR is the cluster all of your cluster's global configurations all of your cluster's nodes all of your cluster's node configurations** echo $HADOOP_CONF_DIR to see your cluster's configuration location cd $HADOOP_CONF_DIR $ ls capacity- scheduler.xml hadoop- policy.xml myhadoop.conf configuration.xsl hdfs- site.xml slaves core- site.xml log4j.properties ssl- client.xml.example fair- scheduler.xml mapred- queue- acls.xml ssl- server.xml.example hadoop- env.sh mapred- site.xml taskcontroller.cfg hadoop- metrics2.properties masters
50 core-site.xml <property> <name>hadoop.tmp.dir</name> <value>/scratch/glock/ gordon- fe2.local/tmp</value> <description>a base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://gcn ibnet0:54310</value> <description>the name of the default file system. </description> </property>
51 core-site.xml <property> <name>io.file.buffer.size</name> <value>131072</value> <description>the size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. </description> </property> More parameters and their default values:
52 hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/scratch/glock/ gordon- fe2.local/namenode_data </value> <final>true</final> </property> <property> <name>dfs.data.dir</name> <value>/scratch/glock/ gordon- fe2.local/hdfs_data</value> <final>true</final> </property>
53 hdfs-site.xml <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.block.size</name> <value> </value> </property> More parameters and their default values:
54 mapred-site.xml <property> <name>mapred.job.tracker</name> <value>gcn ibnet0:54311</value> <description>the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in- process as a single map and reduce task. </description> </property> <property> <name>mapred.local.dir</name> <value>/scratch/glock/ gordon- fe2.local/mapred_scratch </value> <final>true</final> </property>
55 mapred-site.xml <property> <name>io.sort.mb</name> <value>650</value> <description>higher memory- limit while sorting data.</description> </property> <property> <name>mapred.map.tasks</name> <value>4</value> <description>the default number of map tasks per job.</description> </property> <property> <name>mapred.reduce.tasks</name> <value>4</value> <description>the default number of reduce tasks per job. </description> </property>
56 mapred-site.xml <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> <description>the maximum number of map tasks that will be run simultaneously by a task tracker.</description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>8</value> <description>the maximum number of reduce tasks that will be run simultaneously by a task tracker.</description> </property> More parameters and their default values:
57 hadoop-env.sh Establishes environment variables for all Hadoop components Essentials: HADOOP_LOG_DIR location of Hadoop logs HADOOP_PID_DIR location of Hadoop PID files JAVA_HOME location of Java that Hadoop should use Other common additions LD_LIBRARY_PATH - for mappers/reducers HADOOP_CLASSPATH - for mappers/reducers _JAVA_OPTIONS - to hack global Java options
58 Hadoop Installation & Configuration HANDS-ON CONFIGURATION
59 masters / slaves Used by start- all.sh and stop- all.sh Hadoop control scripts $ cat masters gcn ibnet0 $ cat slaves gcn ibnet0 gcn ibnet0 masters secondary namenodes slaves datanodes + tasktrackers namenode and jobtracker are launched on localhost
60 Launch Your First Hadoop Cluster (on Gordon) $ start- all.sh starting namenode, logging to /scratch/glock/ gordon- fe2.local/logs/hadoop- gloc gcn ibnet0: starting datanode, logging to /scratch/glock/ gordon- fe2.loca gcn ibnet0: starting datanode, logging to /scratch/glock/ gordon- fe2.loca gcn ibnet0: starting secondarynamenode, logging to /scratch/glock/ gordon starting jobtracker, logging to /scratch/glock/ gordon- fe2.local/logs/hadoop- gl gcn ibnet0: starting tasktracker, logging to /scratch/glock/ gordon- fe2.l gcn ibnet0: starting tasktracker, logging to /scratch/glock/ gordon- fe2.l Verify that namenode and datanodes came up: $ hadoop dfsadmin - report Configured Capacity: ( GB) Present Capacity: ( GB) DFS Remaining: ( GB) DFS Used: (16.03 KB) DFS Used%: 0% Under replicated blocks: 1 Blocks with corrupt replicas: 0...
61 Try it Yourself: Change Replication 1. Shut down cluster to change configuration $ stop- all.sh gcn ibnet0: stopping tasktracker gcn ibnet0: stopping tasktracker stopping namenode gcn ibnet0: stopping datanode gcn ibnet0: stopping datanode gcn ibnet0: stopping secondarynamenode 2. Change dfs.replication setting in hdfs- site.xml by adding the following: <property> <name>dfs.replication</name> <value>2</value> </property> 3. Restart (start- all.sh) and re-check
62 Try it Yourself: Change Replication Are the under-replicated blocks gone? What does this mean for actual runtime behavior?
63 Run Some Simple Commands Get a small set of data (a bunch of classic texts) and copy into HDFS: $ ls lh gutenberg.txt - rw- r- - r- - 1 glock sdsc 407M Mar 16 14:09 gutenberg.txt $ hadoop dfs - mkdir data $ hadoop dfs - put gutenberg.txt data/ $ hadoop dfs - ls data Found 1 items - rw- r- - r- - 3 glock supergroup :21 /user/glock/data/gutenberg.txt
64 Run Your First Map/Reduce Job $ hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount data wordcount- output 14/03/16 14:24:18 INFO input.fileinputformat: Total input paths to process : 1 14/03/16 14:24:18 INFO util.nativecodeloader: Loaded the native- hadoop library 14/03/16 14:24:18 WARN snappy.loadsnappy: Snappy native library not loaded 14/03/16 14:24:18 INFO mapred.jobclient: Running job: job_ _ /03/16 14:24:19 INFO mapred.jobclient: map 0% reduce 0% 14/03/16 14:24:35 INFO mapred.jobclient: map 15% reduce 0%... 14/03/16 14:25:23 INFO mapred.jobclient: map 100% reduce 28% 14/03/16 14:25:32 INFO mapred.jobclient: map 100% reduce 100% 14/03/16 14:25:37 INFO mapred.jobclient: Job complete: job_ _ /03/16 14:25:37 INFO mapred.jobclient: Counters: 29 14/03/16 14:25:37 INFO mapred.jobclient: Job Counters 14/03/16 14:25:37 INFO mapred.jobclient: Launched reduce tasks= /03/16 14:25:37 INFO mapred.jobclient: Launched map tasks=7 14/03/16 14:25:37 INFO mapred.jobclient: Data- local map tasks=7
65 Running Hadoop Jobs hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount data wordcount- output hadoop jar launch a map/reduce job $HADOOP_HOME/hadoop- examples jar map/reduce application jar wordcount command-line argument for jarfile data location of input data (file or directory) wordcount- output location to dump job output
66 Check Job Output $ hadoop dfs - ls wordcount- output Found 3 items... glock supergroup... /user/glock/wordcount- output/_success... glock supergroup... /user/glock/wordcount- output/_logs... glock supergroup... /user/glock/wordcount- output/part- r glock supergroup... /user/glock/wordcount- output/part- r _SUCCESS if exists, job finished successfully _logs directory containing job history part- r output of reducer #0 part- r output of reducer #1 part- r- 0000n output of nth reducer
67 Check Job Output $ hadoop job history wordcount- output Hadoop job: 0010_ _glock =====================================... Submitted At: 16- Mar :24:18 Launched At: 16- Mar :24:18 (0sec) Finished At: 16- Mar :25:36 (1mins, 18sec)... Task Summary ============================ Kind Total Successful Failed Killed StartTime FinishTime Setup Mar :24: Mar :24:23 (4sec) Map Mar :24: Mar :25:14 (49sec) Reduce Mar :24: Mar :25:29 (40sec) Cleanup Mar :25: Mar :25:35 (4sec) ============================
68 Retrieve Job Output $ hadoop dfs - get wordcount- output/part*. $ ls - lrtph... - rw- r- - r- - 1 glock sdsc 24M Mar 16 14:31 part- r $ hadoop dfs - rmr wordcount- output Deleted hdfs://gcn ibnet0:54310/user/glock/wordcount- output $ hadoop dfs - ls Found 1 items... glock supergroup... /user/glock/data
69 Silly First Example 1. How many mappers were used? 2. How many reducers were used? 3. If Map and Reduce phases are real calculations, and Setup and Cleanup are "job overhead," what fraction of time was spent on overhead?
70 User-Level Configuration Changes Bigger HDFS block size More reducers $ hadoop dfs - Ddfs.block.size= put./gutenberg.txt data/gutenberg- 128M.txt $ hadoop fsck - block /user/glock/data/gutenberg- 128M.txt... Total blocks (validated): 4 (avg. block size B)... $ hadoop jar $HADOOP_HOME/hadoop- examples jar wordcount - Dmapred.reduce.tasks=4 data/gutenberg- 128M.txt output- 128M
71 User-Level Configuration Changes 1. How many mappers were used before and after specifying the user-level tweaks? What caused this change? 2. How much speedup did these tweaks give? Any ideas why? 3. What happens if you use mapred.reduce.tasks=8 or =16? Any ideas why? 4. hadoop dfs - ls output- 128M How many output files are there now? Why?
72 TestDFSIO Test to verify HDFS read/write performance Generates its own input via - write test Make 8 files, each 1024 MB large: $ hadoop jar $HADOOP_HOME/hadoop- test jar TestDFSIO - write - nrfiles 8 - filesize /03/16 16:17:20 INFO fs.testdfsio: nrfiles = 8 14/03/16 16:17:20 INFO fs.testdfsio: filesize (MB) = /03/16 16:17:20 INFO fs.testdfsio: buffersize = /03/16 16:17:20 INFO fs.testdfsio: creating control file: 1024 mega bytes, 8 files 14/03/16 16:17:20 INFO fs.testdfsio: created control files for: 8 files... 14/03/16 16:19:04 INFO fs.testdfsio: TestDFSIO : write 14/03/16 16:19:04 INFO fs.testdfsio: Date & time: Sun Mar 16 16:19:04 PDT /03/16 16:19:04 INFO fs.testdfsio: Number of files: 8 14/03/16 16:19:04 INFO fs.testdfsio: Total MBytes processed: /03/16 16:19:04 INFO fs.testdfsio: Throughput mb/sec: /03/16 16:19:04 INFO fs.testdfsio: Average IO rate mb/sec: /03/16 16:19:04 INFO fs.testdfsio: IO rate std deviation: /03/16 16:19:04 INFO fs.testdfsio: Test exec time sec:
73 TestDFSIO Test the read performance use - read instead of - write: Why such a big performance difference? $ hadoop jar $HADOOP_HOME/hadoop- test jar TestDFSIO - read - nrfiles 8 - filesize /03/16 16:19:59 INFO fs.testdfsio: nrfiles = 8 14/03/16 16:19:59 INFO fs.testdfsio: filesize (MB) = /03/16 16:19:59 INFO fs.testdfsio: buffersize = /03/16 16:19:59 INFO fs.testdfsio: creating control file: 1024 mega bytes, 8 files 14/03/16 16:20:00 INFO fs.testdfsio: created control files for: 8 files... 14/03/16 16:20:40 INFO fs.testdfsio: TestDFSIO : read 14/03/16 16:20:40 INFO fs.testdfsio: Date & time: Sun Mar 16 16:20:40 PDT /03/16 16:20:40 INFO fs.testdfsio: Number of files: 8 14/03/16 16:20:40 INFO fs.testdfsio: Total MBytes processed: /03/16 16:20:40 INFO fs.testdfsio: Throughput mb/sec: /03/16 16:20:40 INFO fs.testdfsio: Average IO rate mb/sec: /03/16 16:20:40 INFO fs.testdfsio: IO rate std deviation: /03/16 16:20:40 INFO fs.testdfsio: Test exec time sec:
74 HDFS Read/Write Performance 1. What part of the HDFS architecture might explain the difference in how long it takes to read 8x1 GB files vs. write 8x1 GB files? 2. What configuration have we covered that might alter this performance gap? 3. Adjust this parameter, restart the cluster, and re-try. How close can you get the read/write performance gap?
75 TeraSort Heavy-Duty Map/Reduce Job Standard benchmark to assess performance of Hadoop's MapReduce and HDFS components teragen Generate a lot of random input data terasort Sort teragen's output teravalidate Verify that terasort's output is sorted
76 Running TeraGen hadoop jar $HADOOP_HOME/hadoop- examples jar teragen terasort- input teragen launch the teragen application how many 100-byte records to generate (100,000,000 ~ 9.3 GB) terasort- input location to store output data to be fed into terasort
77 Running TeraSort $ hadoop jar $HADOOP_HOME/hadoop- examples jar teragen terasort- input Generating using 2 maps with step of $ hadoop jar $HADOOP_HOME/hadoop- examples jar terasort terasort- input terasort- output 14/03/16 16:43:00 INFO terasort.terasort: starting 14/03/16 16:43:00 INFO mapred.fileinputformat: Total input paths to process : 2...
78 TeraSort Default Behavior Default settings are suboptimal: teragen use 2 maps, so TeraSort input split between 2 files terasort use 1 reducer so reduction is serial Try optimizing TeraSort: Double dfs.block.size to (128MB) Increase mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum to increase concurrency Launch with more reducers (- Dmapred.reduce.tasks=XX)
79 Summary of Commands ### Request two nodes for four hours on Gordon $ qsub - I - l nodes=2:ppn=16:native:flash,walltime=4:00:00 - q normal ### Configure $HADOOP_CONF_DIR on Gordon $ myhadoop- configure.sh ### Hadoop control script to start all nodes $ start- all.sh ### Verify HDFS is online $ hadoop dfsadmin - report ### Copy file to HDFS $ hadoop dfs - put somelocalfile hdfsdir/ ### View file information on HDFS $ hadoop fsck - block hdfsdir/somelocalfile ### Run a map/reduce job $ hadoop jar somejarfile.jar - option1 - option2 ### View job info after it completes $ hadoop job - history hdfsdir/outputdir ### Shut down all Hadoop nodes $ stop- all.sh ### Copy logfiles back from nodes $ myhadoop- cleanup.sh
80 Anagram Example Source: Uses Map-Reduce approach to process a file with a list of words, and identify all the anagrams in the file Code is written in Java. Example has already been compiled and the resulting jar file is in the example directory.
81 Anagram Map Class (Detail) String word = value.tostring(); Convert the word to a string char[] wordchars = word.tochararray(); Assign to character array Arrays.sort(wordChars); Sort the array of characters String sortedword = new String(wordChars); Create new string with sorted characters sortedtext.set(sortedword); orginaltext.set(word); outputcollector.collect(sortedtext, orginaltext); Prepare and output the sorted text string (serves as the key), and the original test string.
82 Anagram Map Class (Detail) Consider file with list of words : alpha, hills, shill, truck Alphabetically sorted words : aahlp, hills, hills, ckrtu Hence after the Map Step is done, the following key pairs would be generated: (aahlp,alpha) (hills,hills) (hills,shill) (ckrtu,truck)
83 Anagram Example Reducer (Detail) while(anagramvalues.hasnext()) { Text anagam = anagramvalues.next(); output = output + anagam.tostring() + "~"; } Iterate over all the values for a key. hasnext() is a Java method that allows you to do this. We are also creating an output string which has all the words separated with ~. StringTokenizer outputtokenizer = new StringTokenizer(output,"~ ); StringTokenizer class allows you to store this delimited string and has functions that allow you to count the number of tokens. if(outputtokenizer.counttokens()>=2) { output = output.replace("~", ","); outputkey.set(anagramkey.tostring()); outputvalue.set(output); results.collect(outputkey, outputvalue); } We output the anagram key and the word lists if the number of tokens is >=2 (i.e. we have an anagram pair).
84 Anagram Reducer Class (Detail) For our example set, the input to the Reducers is: (aahlp,alpha) (hills,hills) (hills,shill) (ckrtu,truck) The only key with #tokens >=2 is <hills>. Hence, the Reducer output will be: hills hills, shill,
85 Anagram Example Submit Script #!/bin/bash #PBS -N Anagram #PBS -q normal #PBS -l nodes=2:ppn=16:native #PBS -l walltime=01:00:00 #PBS -m e #PBS -M your @xyz.com #PBS -V #PBS -v Catalina_maxhops=None cd $PBS_O_WORKDIR myhadoop-configure.sh start-all.sh hadoop dfs -mkdir input hadoop dfs -copyfromlocal $PBS_O_WORKDIR/SINGLE.TXT input/ hadoop jar $PBS_O_WORKDIR/AnagramJob.jar input/single.txt output hadoop dfs -copytolocal output/part* $PBS_O_WORKDIR stop-all.sh myhadoop-cleanup.sh
86 Anagram Example Sample Output cat part aabcdelmnu manducable,ambulanced, aabcdeorrsst broadcasters,rebroadcasts, aabcdeorrst rebroadcast,broadcaster, aabcdkrsw drawbacks,backwards, aabcdkrw drawback,backward, aabceeehlnsst teachableness,cheatableness, aabceeelnnrsstu uncreatableness,untraceableness, aabceeelrrt recreatable,retraceable, aabceehlt cheatable,teachable, aabceellr lacerable,clearable, aabceelnrtu uncreatable,untraceable, aabceelorrrstv vertebrosacral,sacrovertebral,
87 Hadoop Streaming Programming Hadoop without Java
88 Hadoop and Python Hadoop streaming w/ Python mappers/reducers portable most difficult (or least difficult) to use you are the glue between Python and Hadoop mrjob (or others: hadoopy, dumbo, etc) comprehensive integration Python interface to Hadoop streaming Analogous interface libraries exist in R, Perl Can interface directly with Amazon
89 Wordcount Example
90 Hadoop Streaming with Python "Simplest" (most portable) method Uses raw Python, Hadoop you are the glue cat input.txt mapper.py sort reducer.py > output.txt provide these two scripts; Hadoop does the rest generalizable to any language you want (Perl, R, etc)
91 Spin Up a Cluster $ idev qsub: waiting for job gordon- fe2.local to start qsub: job gordon- fe2.local ready $ myhadoop- configure.sh myhadoop: Keeping HADOOP_HOME=/home/diag/opt/hadoop/hadoop from user environment myhadoop: Setting MH_IPOIB_TRANSFORM='s/$/.ibnet0/' from myhadoop.conf... $ start- all.sh starting namenode, logging to /scratch/train18/ gordon- fe2.local/logs/hadoop- train18- namenode- gcn sdsc.edu.out gcn ibnet0: starting datanode, logging to /scratch/train18/ gordon- fe2.local/logs/ hadoop- train18- datanode- gcn sdsc.edu.out $ cd PYTHON/streaming $ ls README mapper.py pg2701.txt pg996.txt reducer.py
92 What One Mapper Does line = Call me Ishmael. Some years ago never mind how long keys = Call me Ishmael. Some years ago--never mind how long emit.keyval(key,value)... Call 1 me Ishmael. 1 Some 1 1 years 1 mind 1 ago--never how 1 long 1 1 to the reducers
93 Wordcount: Hadoop streaming mapper #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() keys = line.split() for key in keys: value = 1 print( '%s\t%d' % (key, value) )...
94 Reducer Loop Logic For each key/value pair... If this key is the same as the previous key, Ø add this key's value to our running total. Otherwise, Ø print out the previous key's name and the running total, Ø reset our running total to 0, Ø add this key's value to the running total, and Ø "this key" is now considered the "previous key"
95 Wordcount: Streaming Reducer (1/2) #!/usr/bin/env python import sys last_key = None running_total = 0 for input_line in sys.stdin: input_line = input_line.strip() this_key, value = input_line.split("\t", 1) value = int(value) (to be continued...)
96 Wordcount: Streaming Reducer (2/2) if last_key == this_key: running_total += value else: if last_key: print( "%s\t%d" % (last_key, running_total) ) running_total = value last_key = this_key if last_key == this_key: print( "%s\t%d" % (last_key, running_total) )
97 Testing Mappers/Reducers Debugging Hadoop is not fun $ head - n100 pg2701.txt./mapper.py sort./reducer.py... with 5 word, 1 world you 3 The 1
98 Launching Hadoop Streaming hadoop jar $HADOOP_HOME/contrib/streaming/hadoop- streaming jar - D mapred.reduce.tasks=2 - mapper "$PWD/mapper.py" - reducer "$PWD/reducer.py" - input mobydick.txt - output output.../hadoop- streaming jar - mapper $PWD/mapper.py - reducer $PWD/reducer.py - input mobydick.txt - output output Hadoop Streaming jarfile Mapper executable Reducer executable location (on hdfs) of job input location (on hdfs) of output dir
99 Tools/Applications w/ Hadoop Several open source projects utilizing Hadoop infrastructure. Examples: HIVE A data warehouse infrastructure providing data summarization and querying. Hbase Scalable distributed database PIG High-level data-flow and execution framework Elephantdb Distributed database specializing in exporting key/value data from Hadoop. Some of these projects have been tried on Gordon by users.
100 Apache Mahout Apache project to implement scalable machine learning algorithms. Algorithms implemented: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier
101 Mahout Example Recommendation Mining Collaborative Filtering algorithms aim to solve the prediction problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen. Itembased Collaborative Filtering estimates a user's preference towards an item by looking at his/her preferences towards similar items. Mahout has two Map/Reduce jobs to support Itembased Collaborative Filtering ItemSimiliarityJob Computes similarity values based on input preference data RecommenderJob - Uses input preference data to determine recommended itemids and scores. Can compute Top N recommendations for user for example. Collaborative Filtering is very popular (for example used by Google, Amazon in their recommendations)
102 Mahout Recommendation Mining Once the preference data is collected, a user-item-matrix is created. Task is to predict missing entries based on the known data. Estimate a user s preference for a particular item based on their preferences to similar items. Example: Item 1 Item 2 Item 3 User User 2 5 2? User More details and links to informative presentations at:
103 Mahout Hands on Example Recommendation Job example : Mahout.cmd qsub Mahout.cmd to submit the job. Script also illustrates use of a custom hadoop configuration file. Customized info in mapredsite.xml.stub file which is copied into the configuration used.
104 Mahout Hands On Example The input file is a comma separated list of userids, itemids, and preference scores (test.data). As mentioned in earlier slide, all users need not score all items. In this case there are 5 users and 4 items. The users.txt file provides the list of users for whom we need recommendations. The output from the Mahout recommender job is a file with userids and recommendations (with scores). The show_reco.py file uses the output and combines with user, item info from another input file (items.txt) to produce details recommendation info. This segment runs outside of the hadoop framework.
105 Mahout Sample Output User ID : 2 Rated Items Item 2, rating=2 Item 3, rating=5 Item 4, rating= Recommended Items Item 1, score=
106 PIG PIG - high level programming on top of Hadoop map/reduce Sql-like data flow operations Essentially, key,values abstracted to fields and more complex data strutures
107 Hot off the press on Gordon! Hadoop2 (v2.2.0) installed and available via a module. Apache Spark installed and available via a module. Configured to run with Hadoop2. myhadoop extended to handle Hadoop2/YARN, and Spark.
108 Summary Hadoop framework extensively used for scalable distributed processing of large datasets. Several tools available that leverage the framework. HDFS provides scalable filesystem tailored for accelerating MapReduce performance. Hadoop Streaming can be used to write mapper/reduce functions in any language. Data warehouse (HIVE), scalable distributed database (HBASE), and execution framework (PIG) available. Mahout w/ Hadoop provides scalable machine learning tools.
109 Exercises: Log in to gordon PIG (Hands On) > >qsub -I -q normal -lnodes=2:ppn=1:native,walltime=2:00:00 -A ###### >cd Pig2 > > ps axu grep "p4rod > source Setup_pigtwt.cmd > source Setup_Data4pig.cmd > ps axu grep "p4rod >pig grunt>
110 PIG Hands On - Output > ls Cleanup_pigtwt.cmd Run_pigtwtsrc.qsub hadoophosts.txt pig_wds_grp_cnt.txt Doscript_pigtwt.cmd SetupData4pig.cmd pig_ log rawtw_dtmsg.csv GetDataFrompig.cmd Setup_pigtwt.cmd pig_out_mon_day_cnt.txt tw_script1.pig Setup_4twt gcn ibnet0: starting tasktracker, logging to /scratch/p4rodrig/ gordon-fe2.local/hadoop-p4rodrig/log/hadoop-p4rodrigtasktracker-gcn sdsc.edu.out Warning: $HADOOP_HOME is deprecated Set up Data :11:54,217 [main] INFO org.apache.pig.backend.hadoop.executionengine.hexecutionengine - Connecting to hadoop file system at: hdfs://gcn ibnet0: :11:54,727 [main] INFO org.apache.pig.backend.hadoop.executionengine.hexecutionengine - Connecting to map-reduce job tracker at: gcn ibnet0:54311
111 PIG (Hands On) #watch out for _=_ keep spacing around equal sign raw = LOAD 'pigdata2load.txt' USING PigStorage(',') AS (mon_day,year,hour,msg,place,retwtd,uid,flers_cnt,fling_cnt); tmp = LIMIT raw 5; DUMP tmp; describe raw rawstr = LOAD 'pigdata2load.txt' USING PigStorage(',') AS (mon_day:chararray,year:int,hour:chararray,msg:chararray,place:chararray,retwted:char array,uid:int,flers_cnt:int,fling_cnt:int); describe rawstr
112 PIG Hands On (output) 1 dump tmp INFO org.apache.pig.backend.hadoop.executionengine.util.mapredutil - Total input paths to process : 1 (Apr 03,2012,09:11:56,havent you signe the petition yet hrd alkhawaja in his 55 day of hungerstrike well send it to obama httpstco8ww2sfzx bahrain,null,false, ,185,null ) (Apr 08,2012,07:02:08,rt azmoderate uc chuckgrassley pres obama has clearly demonstrated his intelligence you on the other hand look sound like you shou,null,false, ,904,null ) (Apr 12,2012,13:02:44,rt dcgretch 2 describe raw raw: {mon_day: bytearray,year: bytearray,hour: bytearray,msg: bytearray,place: bytearray,retwtd: bytearray,uid: bytearray,flers_cnt: bytearray,fling_cnt: bytearray} 3 describe rawstr rawstr: {mon_day: chararray,year: int,hour: chararray,msg: chararray,place: chararray,retwted: chararray,uid: int,flers_cnt: int,fling_cnt: int}
113 PIG (Hands On) rawbymon_day = GROUP rawstr BY mon_day; -- mon_day_cnt = FOREACH rawbymon_day GENERATE group,count(*); -- return error about cast mon_day_cnt = FOREACH rawbymon_day GENERATE group,count(rawstr); tmpc = LIMIT mon_day_cnt 5; DUMP tmpc; STORE mon_day_cnt INTO 'pig_out_mon_day_cnt.txt';
114 PIG Hands On - output INFO org.apache.pig.tools.pigstats.simplepigstats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features p4rodrig :19: :20:35 GROUP_BY Success! Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs job_ _ mon_day_cnt,rawbymon_day,rawstr GROUP_BY,COMBINER hdfs://gcn ibnet0:54310/user/ p4rodrig/pig_out_mon_day_cnt.txt, Input(s): Successfully read 109 records (13996 bytes) from: "hdfs://gcn ibnet0:54310/user/p4rodrig/ pigdata2load.txt"
115 PIG (Hands On) raw_msg = FOREACH rawstr {mwds=tokenize(msg); GENERATE $0,$1,$2,mwds;}; tmpd= LIMIT raw_msg 5; DUMP tmpd; raw_msgf = FOREACH rawstr {mwds=tokenize(msg); GENERATE $0,$1,$2,flatten(mwds);}; tmpd = LIMIT raw_msg 5; DUMP tmpd;
116 PIG Hands On - output First dump (Apr 03,2012,09:11:56,{(havent),(you),(signe),(the),(petition),(yet),(hrd),(alkhawaja),(in),(his),(55),(day),(of), (hungerstrike),(well),(send),(it),(to),(obama),(httpstco8ww2sfzx),(bahrain)}) (Apr 08,2012,07:02:08,{(rt),(azmoderate),(uc),(chuckgrassley),(pres),(obama),(has),(clearly),(demonstrated), (his),(intelligence),(you),(on),(the),(other),(hand),(look),(sound),(like),(you),(shou)}) Second dump (Apr 03,2012,09:11:56,{(havent),(you),(signe),(the),(petition),(yet),(hrd),(alkhawaja),(in),(his),(55),(day), (of),(hungerstrike),(well),(send),(it),(to),(obama),(httpstco8ww2sfzx),(bahrain)}) (Apr 08,2012,07:02:08,{(rt),(azmoderate),(uc),(chuckgrassley),(pres),(obama),(has),(clearly), (demonstrated),(his),(intelligence),(you),(on),(the),(other),(hand),(look),(sound),(like),(you),(shou)})
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"
Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni, Rick Wagner, Eva Hocks, Christopher Irving, and Jerry Greenberg! SDSC! XSEDE13, July 22-25, 2013! Overview!!
RDMA for Apache Hadoop 0.9.9 User Guide
0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)
Parallel Options for R
Parallel Options for R Glenn K. Lockwood SDSC User Services [email protected] Motivation "I just ran an intensive R script [on the supercomputer]. It's not much faster than my own machine." Motivation "I
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics
Installation Guide Setting Up and Testing Hadoop on Mac By Ryan Tabora, Think Big Analytics www.thinkbiganalytics.com 520 San Antonio Rd, Suite 210 Mt. View, CA 94040 (650) 949-2350 Table of Contents OVERVIEW
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Hadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1
102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
HDFS Installation and Shell
2012 coreservlets.com and Dima May HDFS Installation and Shell Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all
HDFS Users Guide. Table of contents
Table of contents 1 Purpose...2 2 Overview...2 3 Prerequisites...3 4 Web Interface...3 5 Shell Commands... 3 5.1 DFSAdmin Command...4 6 Secondary NameNode...4 7 Checkpoint Node...5 8 Backup Node...6 9
Introduction to SDSC systems and data analytics software packages "
Introduction to SDSC systems and data analytics software packages " Mahidhar Tatineni ([email protected]) SDSC Summer Institute August 05, 2013 Getting Started" System Access Logging in Linux/Mac Use available
HSearch Installation
To configure HSearch you need to install Hadoop, Hbase, Zookeeper, HSearch and Tomcat. 1. Add the machines ip address in the /etc/hosts to access all the servers using name as shown below. 2. Allow all
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
Running Kmeans Mapreduce code on Amazon AWS
Running Kmeans Mapreduce code on Amazon AWS Pseudo Code Input: Dataset D, Number of clusters k Output: Data points with cluster memberships Step 1: for iteration = 1 to MaxIterations do Step 2: Mapper:
1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation
1. GridGain In-Memory Accelerator For Hadoop GridGain's In-Memory Accelerator For Hadoop edition is based on the industry's first high-performance dual-mode in-memory file system that is 100% compatible
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015
Lecture 2 (08/31, 09/02, 09/09): Hadoop Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015 K. Zhang BUDT 758 What we ll cover Overview Architecture o Hadoop
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Hadoop for HPC Instructor: Ekpe Okorafor Outline Hadoop Basics Hadoop Infrastructure HDFS MapReduce Hadoop & HPC Hadoop
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
Hadoop Training Hands On Exercise
Hadoop Training Hands On Exercise 1. Getting started: Step 1: Download and Install the Vmware player - Download the VMware- player- 5.0.1-894247.zip and unzip it on your windows machine - Click the exe
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -
The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - Hadoop Implementation on Riptide 2 Table of Contents Executive
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
5 HDFS - Hadoop Distributed System
5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015
CS242 PROJECT Presented by Moloud Shahbazi Spring 2015 AGENDA Project Overview Data Collection Indexing Big Data Processing PROJECT- PART1 1.1 Data Collection: 5G < data size < 10G Deliverables: Document
Installation and Configuration Documentation
Installation and Configuration Documentation Release 1.0.1 Oshin Prem October 08, 2015 Contents 1 HADOOP INSTALLATION 3 1.1 SINGLE-NODE INSTALLATION................................... 3 1.2 MULTI-NODE
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Miles Osborne (original: Sasa Petrovic) October 23, 2012 1 Getting started First you need to access the machine where you will be doing all the work. Do this
CSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
A. Aiken & K. Olukotun PA3
Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so
GraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
The Hadoop Distributed File System
The Hadoop Distributed File System The Hadoop Distributed File System, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, Yahoo, 2010 Agenda Topic 1: Introduction Topic 2: Architecture
Sriram Krishnan, Ph.D. [email protected]
Sriram Krishnan, Ph.D. [email protected] (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.
Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone
Single Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
HADOOP MOCK TEST HADOOP MOCK TEST I
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,[email protected]
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay
Hadoop Distributed File System. Dhruba Borthakur June, 2007
Hadoop Distributed File System Dhruba Borthakur June, 2007 Goals of HDFS Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Hadoop Lab - Setting a 3 node Cluster. http://hadoop.apache.org/releases.html. Java - http://wiki.apache.org/hadoop/hadoopjavaversions
Hadoop Lab - Setting a 3 node Cluster Packages Hadoop Packages can be downloaded from: http://hadoop.apache.org/releases.html Java - http://wiki.apache.org/hadoop/hadoopjavaversions Note: I have tested
Data Science Analytics & Research Centre
Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
