Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Similar documents

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Hadoop on the Gordon Data Intensive Cluster

Introduction to SDSC systems and data analytics software packages "

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

MapReduce Evaluator: User Guide

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

How To Use Hadoop

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Getting Started with Hadoop with Amazon s Elastic MapReduce

Apache Hadoop new way for the company to store and analyze big data

Hadoop Architecture. Part 1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban

map/reduce connected components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

RDMA for Apache Hadoop User Guide

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Enabling High performance Big Data platform with RDMA

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HiBench Introduction. Carson Wang Software & Services Group

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop (pseudo-distributed) installation and configuration

MapReduce. Tushar B. Kute,

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Jeffrey D. Ullman slides. MapReduce for data intensive computing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop. Bioinformatics Big Data

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Tutorial for Assignment 2.0

Sriram Krishnan, Ph.D.

Big Data With Hadoop

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Parallel Options for R

How To Install Hadoop From Apa Hadoop To (Hadoop)

Hadoop/MapReduce Workshop

Open source Google-style large scale data analysis with Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Apache Hadoop. Alexandru Costan

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Understanding Hadoop Performance on Lustre

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

HADOOP MOCK TEST HADOOP MOCK TEST II

Deploying Hadoop with Manager

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

MapReduce, Hadoop and Amazon AWS

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Accelerating and Simplifying Apache

Tutorial for Assignment 2.0

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

A very short Intro to Hadoop

Large scale processing using Hadoop. Ján Vaňo

Introduction to HDFS. Prasanth Kothuri, CERN

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Performance and Energy Efficiency of. Hadoop deployment models

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

THE HADOOP DISTRIBUTED FILE SYSTEM

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Distributed Filesystems

HadoopRDF : A Scalable RDF Data Analysis System

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

TP1: Getting Started with Hadoop

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Map Reduce & Hadoop Recommended Text:

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Extreme computing lab exercises Session one

HDFS. Hadoop Distributed File System

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

COURSE CONTENT Big Data and Hadoop Training

MapReduce with Apache Hadoop Analysing Big Data

Benchmarking Hadoop & HBase on Violin

Transcription:

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013

Overview "" Hadoop framework extensively used for scalable distributed processing of large datasets. Typical MapReduce jobs make use of locality of data, processing data on or near the storage assets to minimize data movement. Hadoop Distributed Filesystem (HDFS) also has built in redundancy and fault tolerance. Map Step: The master process divides problem into smaller subproblems, and distributes them to slave nodes as map tasks. Reduce Step: Processed data from slave node, i.e. output from the map tasks, serves as input to the reduce tasks to form the ultimate output.

Hadoop: Application Areas "" Hadoop is widely used in data intensive analysis. Some application areas include: Log aggregation and processing" Video and Image analysis" Data mining, Machine learning" Indexing" Recommendation systems" Data intensive scientific applications can make use of the Hadoop MapReduce framework. Application areas include: Bioinformatics and computational biology" Astronomical image processing" Natural Language Processing" Geospatial data processing" " Extensive list online at: http://wiki.apache.org/hadoop/poweredby"

Hadoop Architecture" Master Node Jobtracker" Namenode" Network/ Switching Compute/ Datanode" Compute/ Datanode" Compute/ Datanode" Compute/ Datanode" Tasktracker" Tasktracker" Tasktracker" Tasktracker" Map/Reduce Framework HDFS Distributed Filesystem Software to enable distributed computation. Jobtracker schedules and manages map/ reduce tasks. Tasktracker does the execution of tasks on the nodes. Metadata handled by the Namenode. Files are split up and stored on datanodes (typically local disk). Scalable and fault tolerance. Replication is done asynchronously.

Simple Example Apache Site*" Simple wordcount example. Code details: " Functions defined" "- Wordcount map class : reads file and isolates each word " "- Reduce class : counts words and sums up" " Call sequence" "- Mapper class" "- Combiner class (Reduce locally)" "- Reduce class" "- Output" *http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html

Simple Example Apache Site*" Simple wordcount example. Two input files. File 1 contains: Hello World Bye World File 2 contains: Hello Hadoop Goodbye Hadoop Assuming we use two map tasks (one for each file). Step 1: Map read/parse tasks are complete. Result: Task 1 <Hello, 1> <World, 1> <Bye, 1> <World, 1> *http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html Task 2 < Hello, 1> < Hadoop, 1> < Goodbye, 1> <Hadoop, 1>

Simple Example (Contd)" Step 2 : Combine on each node, sorted: Task 1 <Bye, 1> <Hello, 1> <World, 2> Task 2 < Goodbye, 1> < Hadoop, 2> <Hello, 1> Step 3 : Global reduce: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>

Benefits of Gordon Architecture for Hadoop" Gordon architecture presents potential performance benefits with high performance storage (SSDs) and high speed network (QDR IB). Storage options viable for Hadoop on Gordon SSD via iscsi Extensions for RDMA (iser)" Lustre filesystem (Data Oasis), persistent storage" Approach to running Hadoop within the scheduler infrastructure myhadoop (developed at SDSC).

Overview (Contd.)" Network options 1 GigE low performance" IPoIB Can make use of high speed infiniband network." Sample Results in current talk DFS Benchmark runs done on SSD based storage." TeraSort Scaling study with sizes 100GB-1.6TB, 4-128 nodes." Future/Ongoing Work for improving performance Hadoop-RDMA Network-Based Computation Lab, OSU" UDA - Unstructured Data Accelerator from Mellanox" http://www.mellanox.com/related-docs/applications/sb_hadoop.pdf" Lustre for Hadoop Can provide another persistent data option." "

Gordon Architecture"

Subrack Level Architecture" 16 compute nodes per subrack"

3D Torus of Switches" Linearly expandable Simple wiring pattern Short Cables- Fiber Optic cables generally not required Lower Cost :40% as many switches, 25% to 50% fewer cables Works well for localized communication Fault Tolerant within the mesh with 2QoS Alternate Routing Fault Tolerant with Dual-Rails for all routing algorithms 3 rd dimension wrap-around not shown for clarity"

Primary Storage Option for Hadoop: One SSD per Compute Node (only 4 of 16 compute nodes shown)" Lustre" Compute Node" Compute Node" Compute Node" Compute Node" SSD" SSD" SSD" SSD" The SSD drives on each I/O node are exported to the compute nodes via the iscsi Extensions for RDMA (iser). One 300 GB flash drive exported to each compute node appears as a local file system Lustre parallel file system is mounted identically on all nodes. This exported SSD storage can serve as the local storage used by HDFS. " Logical View File system appears as: /scratch/$user/$pbs_jobid

Hadoop Configuration Details"

myhadoop Integration with Gordon Scheduler" myhadoop* was developed at SDSC to help run Hadoop within the scope of the normal scheduler. Scripts available to use the node list from the scheduler and dynamically change Hadoop configuration files. mapred-site.xml" masters" slaves" core-site.xml" Users can make additional hadoop cluster changes if needed Nodes are exclusive to the job [does not affect the rest of the cluster]. *myhadoop link: http://sourceforge.net/projects/myhadoop/

Hands On Examples"" Examples are in the hadoop directory within the SI2013 directory: cd $HOME/SI2013/hadoop" Start with simple benchmark examples: TestDFS_2.cmd TestDFS example to benchmark HDFS performance." TeraSort_2.cmd Sorting performance benchmark."

TestDFS Example" PBS variables part: #/bin/bash #PBS -q normal #PBS -N hadoop_job #PBS -l nodes=2:ppn=1 #PBS -o hadoop_dfstest_2.out #PBS -e hadoop_dfstest_2.err #PBS -V

TestDFS example" Set up Hadoop environment variables: # Set this to location of myhadoop on gordon export MY_HADOOP_HOME="/opt/hadoop/contrib/myHadoop" # Set this to the location of Hadoop on gordon export HADOOP_HOME="/opt/hadoop" #### Set this to the directory where Hadoop configs should be generated # Don't change the name of this variable (HADOOP_CONF_DIR) as it is # required by Hadoop - all config files will be picked up from here # # Make sure that this is accessible to all nodes export HADOOP_CONF_DIR="/home/$USER/config"

TestDFS Example" #### Set up the configuration # Make sure number of nodes is the same as what you have requested from PBS # usage: $MY_HADOOP_HOME/bin/configure.sh -h echo "Set up the configurations for myhadoop" ### Create a hadoop hosts file, change to ibnet0 interfaces - DO NOT REMOVE - sed 's/$/.ibnet0/' $PBS_NODEFILE > $PBS_O_WORKDIR/hadoophosts.txt export PBS_NODEFILEZ=$PBS_O_WORKDIR/hadoophosts.txt ### Copy over configuration files $MY_HADOOP_HOME/bin/configure.sh -n 2 -c $HADOOP_CONF_DIR ### Point hadoop temporary files to local scratch - DO NOT REMOVE - sed -i 's@haddtemp@'$pbs_jobid'@g' $HADOOP_CONF_DIR/hadoopenv.sh

TestDFS Example" #### Format HDFS, if this is the first time or not a persistent instance echo "Format HDFS" $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR namenode -format echo sleep 1m #### Start the Hadoop cluster echo "Start all Hadoop daemons" $HADOOP_HOME/bin/start-all.sh #$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave echo

TestDFS Example" #### Run your jobs here echo "Run some test Hadoop jobs" $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadooptest-1.0.3.jar TestDFSIO -write -nrfiles 8 -filesize 1024 -buffersize 1048576 sleep 30s $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadooptest-1.0.3.jar TestDFSIO -read - nrfiles 8 -filesize 1024 -buffersize 1048576 echo #### Stop the Hadoop cluster echo "Stop all Hadoop daemons" $HADOOP_HOME/bin/stop-all.sh echo

Running the TestDFS example" Submit the job: qsub TestDFS_2.cmd " " Check the job is running (qstat) Once the job is running the hadoophosts.txt file is created. For example on a sample run: $ more hadoophosts.txt " gcn-13-11.ibnet0" gcn-13-12.ibnet0" "

Configuration Files During the Run" Lets check the configuration details (files in ~/ config) setup dynamically: $ more masters " gcn-13-11.ibnet0" $ more slaves " gcn-13-11.ibnet0" gcn-13-12.ibnet0" $ grep scratch hadoop-env.sh " export HADOOP_PID_DIR=/scratch/$USER/538309.gordon-fe2.local" export TMPDIR=/scratch/$USER/538309.gordon-fe2.local" export HADOOP_LOG_DIR=/scratch/mahidhar/538309.gordon-fe2.local/hadoopmahidhar/log" $ grep hdfs core-site.xml " <value>hdfs://gcn-13-11.ibnet0:54310</value>" "

Test Output" Check the PBS output and error files for the results. Sample snippets: 13/01/30 18:56:00 INFO fs.testdfsio: Number of files: 8" 13/01/30 18:56:00 INFO fs.testdfsio: Total MBytes processed: 8192" 13/01/30 18:56:00 INFO fs.testdfsio: Throughput mb/sec: 41.651837012782316" 13/01/30 18:56:00 INFO fs.testdfsio: Average IO rate mb/sec: 93.37788391113281" 13/01/30 18:56:00 INFO fs.testdfsio: IO rate std deviation: 58.587153756958564" 13/01/30 18:56:00 INFO fs.testdfsio: Test exec time sec: 99.554" 13/01/30 18:56:00 INFO fs.testdfsio: " "." 13/01/30 18:57:16 INFO fs.testdfsio: Number of files: 8" 13/01/30 18:57:16 INFO fs.testdfsio: Total MBytes processed: 8192" 13/01/30 18:57:16 INFO fs.testdfsio: Throughput mb/sec: 207.49746707193515" 13/01/30 18:57:16 INFO fs.testdfsio: Average IO rate mb/sec: 207.5077362060547" 13/01/30 18:57:16 INFO fs.testdfsio: IO rate std deviation: 1.4632078247537383" 13/01/30 18:57:16 INFO fs.testdfsio: Test exec time sec: 41.365" "

Hadoop on Gordon TestDFSIO Results" 180" 160" 140" Average I/O (MB/s) 120" 100" 80" 60" 40" 20" 0" 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" File Size (GB) Ethernet - 1GigE" IPoIB" UDA" Results from the TestDFSIO benchmark. Runs were conducted using 4 datanodes and 2 map tasks. Average write I/O rates are presented.

Hadoop on Gordon TeraSort Results" (a) Results from the TeraSort benchmark. Data generated using teragen and is then sorted. Two sets or runs were performed using IPoIB, SSD storage (a) Fixed dataset size of 100GB, number of nodes varied from 4 to 32, (b) Increasing dataset size with number of nodes. (b)

Hadoop Parameters" Several parameters can be tuned at the system level, in general configuration (namenode and jobtracker server threads), HDFS configuration (blocksize), MapReduce configuration (number of tasks, input streams, buffer sizes etc). Good summary in: http://software.intel.com/sites/default/files/m/f/ 4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

Using Hadoop Framework" Hadoop framework developed in Java, with Java API available to developers. Map/Reduce applications can use Java or have several other options: C++ API using Hadoop Pipes" Python Using Hadoop Streaming, Dumbo, Pydoop" Hadoop Streaming Allows users to create and run map/reduce jobs with custom executables and scripts as map and reduce tasks."

Hadoop Streaming Write mapper and reducer functions in any language Hadoop passes records and key/values via stdin/ stdout pipes: cat input.txt mapper.py sort reducer.py > output.txt provide these two scripts; Hadoop does the rest Don t have to know Java--can adapt existing scripts, libraries

Hadoop Streaming Word Count

Word Count mapper.py #/usr/bin/env python import sys for line in sys.stdin: line = line.strip() keys = line.split() for key in keys: value = 1 print( '%s\t%d' % (key, value) )

Word Count - Reducer All key/value pairs are sorted before being presented to the reducers All key/value pairs sharing the same key are sent to the same reducer call 1 call 1 me 1 me 1 me 1 Ishmael 1 If this key is the same as the previous key, add this key's value to our running total. Otherwise, print out the previous key's name and the running total, reset our running total to 0, add this key's value to the running total, and "this key" is now considered the "previous key"

Word Count reducer.py (1/2) #/usr/bin/env python import sys last_key = None running_total = 0 for input_line in sys.stdin: input_line = input_line.strip() this_key, value = input_line.split("\t", 1) value = int(value)...

Word Count reducer.py (2/2) for input_line in sys.stdin:... if last_key == this_key: running_total += value else: if last_key: print( "%s\t%d" % (last_key, running_total) ) running_total = value last_key = this_key if last_key == this_key: print( "%s\t%d" % (last_key, running_total) ) /home/glock/tutorials/hpchadoop/wordcount.py/reducer.py

Word Count Input Data $ wget http://www.gutenberg.org/cache/epub/2701/ pg2701.txt (or just cp /home/glock/tutorials/hpchadoop/wordcount.py/ pg2701.txt) Test your mappers/reducers $ head n1000./mapper.py sort./reducer.py... young 4 your 16 yourself 3 zephyr 1

Word Count Job Launch # Move input data into HDFS $ hadoop dfs -mkdir wordcount $ hadoop dfs -copyfromlocal./pg2701.txt wordcount/ mobydick.txt # Job launch $ hadoop \ jar /opt/hadoop/contrib/streaming/hadoopstreaming-1.0.3.jar \ -mapper "python $PWD/mapper.py" \ -reducer "python $PWD/reducer.py" \ -input "wordcount/mobydick.txt" \ -output "wordcount/output"

Tools/Applications w/ Hadoop" Several open source projects utilizing Hadoop infrastructure. Examples: HIVE A data warehouse infrastructure providing data summarization and querying." Hbase Scalable distributed database" PIG High-level data-flow and execution framework" Elephantdb Distributed database specializing in exporting key/value data from Hadoop." Some these projects have been tried on Gordon by users.

Apache Mahout" Apache project to implement scalable machine learning algorithms. Algorithms implemented: Collaborative Filtering" User and Item based recommenders" K-Means, Fuzzy K-Means clustering" Mean Shift clustering" Dirichlet process clustering" Latent Dirichlet Allocation" Singular value decomposition" Parallel Frequent Pattern mining" Complementary Naive Bayes classifier" Random forest decision tree based classifier"

Mahout Example Recommendation Mining" Collaborative Filtering algorithms aim to solve the prediction problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen." Itembased Collaborative Filtering estimates a user's preference towards an item by looking at his/her preferences towards similar items." Mahout has two Map/Reduce jobs to support Itembased Collaborative Filtering" ItemSimiliarityJob Computes similarity values based on input preference data" RecommenderJob - Uses input preference data to determine recommended itemids and scores. Can compute Top N recommendations for user for example." Collaborative Filtering is very popular (for example used by Google, Amazon in their recommendations) "

Mahout Recommendation Mining" Once the preference data is collected, a user-item-matrix is created. " Task is to predict missing entries based on the known data. Estimate a userʼs preference for a particular item based on their preferences to similar items." Example:" Item 1 Item 2 Item 3 User 1" 4" 3" 6" User 2" 5" 2"?" User 3" 3" 2" 1" More details and links to informative presentations at:" https://cwiki.apache.org/confluence/display/mahout/itembased+collaborative+filtering"

Mahout Hands on Example" Recommendation Job example : Mahout.cmd" qsub Mahout.cmd to submit the job. " Script also illustrates use of a custom hadoop configuration file. Customized info in mapredsite.xml.stub file which is copied into the configuration used."

Mahout Hands On Example" The input file is a comma separated list of userids, itemids, and preference scores (test.data). As mentioned in earlier slide, all users need not score all items. In this case there are 5 users and 4 items. The users.txt file provides the list of users for whom we need recommendations." The output from the Mahout recommender job is a file with userids and recommendations (with scores)." The show_reco.py file uses the output and combines with user, item info from another input file (items.txt) to produce details recommendation info. This segment runs outside of the hadoop framework."

Mahout Sample Output" User ID : 2 Rated Items ------------------------ Item 2, rating=2 Item 3, rating=5 Item 4, rating=3 ------------------------ Recommended Items ------------------------ Item 1, score=3.142857 ------------------------

Summary" Hadoop framework extensively used for scalable distributed processing of large datasets. Several tools available that leverage the framework. Hadoop can be dynamically configured and run on Gordon using myhadoop framework. High speed Infiniband network allows for significant gains in performance using SSDs. TeraSort benchmark shows good performance and scaling on Gordon. Hadoop Streaming can be used to write mapper/reduce functions in any language. Mahout w/ Hadoop provides scalable machine learning tools.