Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"
|
|
|
- Asher Kerry Scott
- 10 years ago
- Views:
Transcription
1 Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013
2 Overview "" Hadoop framework extensively used for scalable distributed processing of large datasets. Typical MapReduce jobs make use of locality of data, processing data on or near the storage assets to minimize data movement. Hadoop Distributed Filesystem (HDFS) also has built in redundancy and fault tolerance. Map Step: The master process divides problem into smaller subproblems, and distributes them to slave nodes as map tasks. Reduce Step: Processed data from slave node, i.e. output from the map tasks, serves as input to the reduce tasks to form the ultimate output.
3 Hadoop: Application Areas "" Hadoop is widely used in data intensive analysis. Some application areas include: Log aggregation and processing" Video and Image analysis" Data mining, Machine learning" Indexing" Recommendation systems" Data intensive scientific applications can make use of the Hadoop MapReduce framework. Application areas include: Bioinformatics and computational biology" Astronomical image processing" Natural Language Processing" Geospatial data processing" " Extensive list online at:
4 Hadoop Architecture" Master Node Jobtracker" Namenode" Network/ Switching Compute/ Datanode" Compute/ Datanode" Compute/ Datanode" Compute/ Datanode" Tasktracker" Tasktracker" Tasktracker" Tasktracker" Map/Reduce Framework HDFS Distributed Filesystem Software to enable distributed computation. Jobtracker schedules and manages map/ reduce tasks. Tasktracker does the execution of tasks on the nodes. Metadata handled by the Namenode. Files are split up and stored on datanodes (typically local disk). Scalable and fault tolerance. Replication is done asynchronously.
5 Simple Example Apache Site*" Simple wordcount example. Code details: " Functions defined" "- Wordcount map class : reads file and isolates each word " "- Reduce class : counts words and sums up" " Call sequence" "- Mapper class" "- Combiner class (Reduce locally)" "- Reduce class" "- Output" *
6 Simple Example Apache Site*" Simple wordcount example. Two input files. File 1 contains: Hello World Bye World File 2 contains: Hello Hadoop Goodbye Hadoop Assuming we use two map tasks (one for each file). Step 1: Map read/parse tasks are complete. Result: Task 1 <Hello, 1> <World, 1> <Bye, 1> <World, 1> * Task 2 < Hello, 1> < Hadoop, 1> < Goodbye, 1> <Hadoop, 1>
7 Simple Example (Contd)" Step 2 : Combine on each node, sorted: Task 1 <Bye, 1> <Hello, 1> <World, 2> Task 2 < Goodbye, 1> < Hadoop, 2> <Hello, 1> Step 3 : Global reduce: <Bye, 1> <Goodbye, 1> <Hadoop, 2> <Hello, 2> <World, 2>
8 Benefits of Gordon Architecture for Hadoop" Gordon architecture presents potential performance benefits with high performance storage (SSDs) and high speed network (QDR IB). Storage options viable for Hadoop on Gordon SSD via iscsi Extensions for RDMA (iser)" Lustre filesystem (Data Oasis), persistent storage" Approach to running Hadoop within the scheduler infrastructure myhadoop (developed at SDSC).
9 Overview (Contd.)" Network options 1 GigE low performance" IPoIB Can make use of high speed infiniband network." Sample Results in current talk DFS Benchmark runs done on SSD based storage." TeraSort Scaling study with sizes 100GB-1.6TB, nodes." Future/Ongoing Work for improving performance Hadoop-RDMA Network-Based Computation Lab, OSU" UDA - Unstructured Data Accelerator from Mellanox" Lustre for Hadoop Can provide another persistent data option." "
10 Gordon Architecture"
11 Subrack Level Architecture" 16 compute nodes per subrack"
12 3D Torus of Switches" Linearly expandable Simple wiring pattern Short Cables- Fiber Optic cables generally not required Lower Cost :40% as many switches, 25% to 50% fewer cables Works well for localized communication Fault Tolerant within the mesh with 2QoS Alternate Routing Fault Tolerant with Dual-Rails for all routing algorithms 3 rd dimension wrap-around not shown for clarity"
13 Primary Storage Option for Hadoop: One SSD per Compute Node (only 4 of 16 compute nodes shown)" Lustre" Compute Node" Compute Node" Compute Node" Compute Node" SSD" SSD" SSD" SSD" The SSD drives on each I/O node are exported to the compute nodes via the iscsi Extensions for RDMA (iser). One 300 GB flash drive exported to each compute node appears as a local file system Lustre parallel file system is mounted identically on all nodes. This exported SSD storage can serve as the local storage used by HDFS. " Logical View File system appears as: /scratch/$user/$pbs_jobid
14 Hadoop Configuration Details"
15 myhadoop Integration with Gordon Scheduler" myhadoop* was developed at SDSC to help run Hadoop within the scope of the normal scheduler. Scripts available to use the node list from the scheduler and dynamically change Hadoop configuration files. mapred-site.xml" masters" slaves" core-site.xml" Users can make additional hadoop cluster changes if needed Nodes are exclusive to the job [does not affect the rest of the cluster]. *myhadoop link:
16 Hands On Examples"" Examples are in the hadoop directory within the SI2013 directory: cd $HOME/SI2013/hadoop" Start with simple benchmark examples: TestDFS_2.cmd TestDFS example to benchmark HDFS performance." TeraSort_2.cmd Sorting performance benchmark."
17 TestDFS Example" PBS variables part: #/bin/bash #PBS -q normal #PBS -N hadoop_job #PBS -l nodes=2:ppn=1 #PBS -o hadoop_dfstest_2.out #PBS -e hadoop_dfstest_2.err #PBS -V
18 TestDFS example" Set up Hadoop environment variables: # Set this to location of myhadoop on gordon export MY_HADOOP_HOME="/opt/hadoop/contrib/myHadoop" # Set this to the location of Hadoop on gordon export HADOOP_HOME="/opt/hadoop" #### Set this to the directory where Hadoop configs should be generated # Don't change the name of this variable (HADOOP_CONF_DIR) as it is # required by Hadoop - all config files will be picked up from here # # Make sure that this is accessible to all nodes export HADOOP_CONF_DIR="/home/$USER/config"
19 TestDFS Example" #### Set up the configuration # Make sure number of nodes is the same as what you have requested from PBS # usage: $MY_HADOOP_HOME/bin/configure.sh -h echo "Set up the configurations for myhadoop" ### Create a hadoop hosts file, change to ibnet0 interfaces - DO NOT REMOVE - sed 's/$/.ibnet0/' $PBS_NODEFILE > $PBS_O_WORKDIR/hadoophosts.txt export PBS_NODEFILEZ=$PBS_O_WORKDIR/hadoophosts.txt ### Copy over configuration files $MY_HADOOP_HOME/bin/configure.sh -n 2 -c $HADOOP_CONF_DIR ### Point hadoop temporary files to local scratch - DO NOT REMOVE - sed -i 's@haddtemp@'$pbs_jobid'@g' $HADOOP_CONF_DIR/hadoopenv.sh
20 TestDFS Example" #### Format HDFS, if this is the first time or not a persistent instance echo "Format HDFS" $HADOOP_HOME/bin/hadoop --config $HADOOP_CONF_DIR namenode -format echo sleep 1m #### Start the Hadoop cluster echo "Start all Hadoop daemons" $HADOOP_HOME/bin/start-all.sh #$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave echo
21 TestDFS Example" #### Run your jobs here echo "Run some test Hadoop jobs" $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadooptest jar TestDFSIO -write -nrfiles 8 -filesize buffersize sleep 30s $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadooptest jar TestDFSIO -read - nrfiles 8 -filesize buffersize echo #### Stop the Hadoop cluster echo "Stop all Hadoop daemons" $HADOOP_HOME/bin/stop-all.sh echo
22 Running the TestDFS example" Submit the job: qsub TestDFS_2.cmd " " Check the job is running (qstat) Once the job is running the hadoophosts.txt file is created. For example on a sample run: $ more hadoophosts.txt " gcn ibnet0" gcn ibnet0" "
23 Configuration Files During the Run" Lets check the configuration details (files in ~/ config) setup dynamically: $ more masters " gcn ibnet0" $ more slaves " gcn ibnet0" gcn ibnet0" $ grep scratch hadoop-env.sh " export HADOOP_PID_DIR=/scratch/$USER/ gordon-fe2.local" export TMPDIR=/scratch/$USER/ gordon-fe2.local" export HADOOP_LOG_DIR=/scratch/mahidhar/ gordon-fe2.local/hadoopmahidhar/log" $ grep hdfs core-site.xml " <value>hdfs://gcn ibnet0:54310</value>" "
24 Test Output" Check the PBS output and error files for the results. Sample snippets: 13/01/30 18:56:00 INFO fs.testdfsio: Number of files: 8" 13/01/30 18:56:00 INFO fs.testdfsio: Total MBytes processed: 8192" 13/01/30 18:56:00 INFO fs.testdfsio: Throughput mb/sec: " 13/01/30 18:56:00 INFO fs.testdfsio: Average IO rate mb/sec: " 13/01/30 18:56:00 INFO fs.testdfsio: IO rate std deviation: " 13/01/30 18:56:00 INFO fs.testdfsio: Test exec time sec: " 13/01/30 18:56:00 INFO fs.testdfsio: " "." 13/01/30 18:57:16 INFO fs.testdfsio: Number of files: 8" 13/01/30 18:57:16 INFO fs.testdfsio: Total MBytes processed: 8192" 13/01/30 18:57:16 INFO fs.testdfsio: Throughput mb/sec: " 13/01/30 18:57:16 INFO fs.testdfsio: Average IO rate mb/sec: " 13/01/30 18:57:16 INFO fs.testdfsio: IO rate std deviation: " 13/01/30 18:57:16 INFO fs.testdfsio: Test exec time sec: " "
25 Hadoop on Gordon TestDFSIO Results" 180" 160" 140" Average I/O (MB/s) 120" 100" 80" 60" 40" 20" 0" 0" 1" 2" 3" 4" 5" 6" 7" 8" 9" File Size (GB) Ethernet - 1GigE" IPoIB" UDA" Results from the TestDFSIO benchmark. Runs were conducted using 4 datanodes and 2 map tasks. Average write I/O rates are presented.
26 Hadoop on Gordon TeraSort Results" (a) Results from the TeraSort benchmark. Data generated using teragen and is then sorted. Two sets or runs were performed using IPoIB, SSD storage (a) Fixed dataset size of 100GB, number of nodes varied from 4 to 32, (b) Increasing dataset size with number of nodes. (b)
27 Hadoop Parameters" Several parameters can be tuned at the system level, in general configuration (namenode and jobtracker server threads), HDFS configuration (blocksize), MapReduce configuration (number of tasks, input streams, buffer sizes etc). Good summary in: 4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
28 Using Hadoop Framework" Hadoop framework developed in Java, with Java API available to developers. Map/Reduce applications can use Java or have several other options: C++ API using Hadoop Pipes" Python Using Hadoop Streaming, Dumbo, Pydoop" Hadoop Streaming Allows users to create and run map/reduce jobs with custom executables and scripts as map and reduce tasks."
29 Hadoop Streaming Write mapper and reducer functions in any language Hadoop passes records and key/values via stdin/ stdout pipes: cat input.txt mapper.py sort reducer.py > output.txt provide these two scripts; Hadoop does the rest Don t have to know Java--can adapt existing scripts, libraries
30 Hadoop Streaming Word Count
31 Word Count mapper.py #/usr/bin/env python import sys for line in sys.stdin: line = line.strip() keys = line.split() for key in keys: value = 1 print( '%s\t%d' % (key, value) )
32 Word Count - Reducer All key/value pairs are sorted before being presented to the reducers All key/value pairs sharing the same key are sent to the same reducer call 1 call 1 me 1 me 1 me 1 Ishmael 1 If this key is the same as the previous key, add this key's value to our running total. Otherwise, print out the previous key's name and the running total, reset our running total to 0, add this key's value to the running total, and "this key" is now considered the "previous key"
33 Word Count reducer.py (1/2) #/usr/bin/env python import sys last_key = None running_total = 0 for input_line in sys.stdin: input_line = input_line.strip() this_key, value = input_line.split("\t", 1) value = int(value)...
34 Word Count reducer.py (2/2) for input_line in sys.stdin:... if last_key == this_key: running_total += value else: if last_key: print( "%s\t%d" % (last_key, running_total) ) running_total = value last_key = this_key if last_key == this_key: print( "%s\t%d" % (last_key, running_total) ) /home/glock/tutorials/hpchadoop/wordcount.py/reducer.py
35 Word Count Input Data $ wget pg2701.txt (or just cp /home/glock/tutorials/hpchadoop/wordcount.py/ pg2701.txt) Test your mappers/reducers $ head n1000./mapper.py sort./reducer.py... young 4 your 16 yourself 3 zephyr 1
36 Word Count Job Launch # Move input data into HDFS $ hadoop dfs -mkdir wordcount $ hadoop dfs -copyfromlocal./pg2701.txt wordcount/ mobydick.txt # Job launch $ hadoop \ jar /opt/hadoop/contrib/streaming/hadoopstreaming jar \ -mapper "python $PWD/mapper.py" \ -reducer "python $PWD/reducer.py" \ -input "wordcount/mobydick.txt" \ -output "wordcount/output"
37 Tools/Applications w/ Hadoop" Several open source projects utilizing Hadoop infrastructure. Examples: HIVE A data warehouse infrastructure providing data summarization and querying." Hbase Scalable distributed database" PIG High-level data-flow and execution framework" Elephantdb Distributed database specializing in exporting key/value data from Hadoop." Some these projects have been tried on Gordon by users.
38 Apache Mahout" Apache project to implement scalable machine learning algorithms. Algorithms implemented: Collaborative Filtering" User and Item based recommenders" K-Means, Fuzzy K-Means clustering" Mean Shift clustering" Dirichlet process clustering" Latent Dirichlet Allocation" Singular value decomposition" Parallel Frequent Pattern mining" Complementary Naive Bayes classifier" Random forest decision tree based classifier"
39 Mahout Example Recommendation Mining" Collaborative Filtering algorithms aim to solve the prediction problem where the task is to estimate the preference of a user towards an item which he/she has not yet seen." Itembased Collaborative Filtering estimates a user's preference towards an item by looking at his/her preferences towards similar items." Mahout has two Map/Reduce jobs to support Itembased Collaborative Filtering" ItemSimiliarityJob Computes similarity values based on input preference data" RecommenderJob - Uses input preference data to determine recommended itemids and scores. Can compute Top N recommendations for user for example." Collaborative Filtering is very popular (for example used by Google, Amazon in their recommendations) "
40 Mahout Recommendation Mining" Once the preference data is collected, a user-item-matrix is created. " Task is to predict missing entries based on the known data. Estimate a userʼs preference for a particular item based on their preferences to similar items." Example:" Item 1 Item 2 Item 3 User 1" 4" 3" 6" User 2" 5" 2"?" User 3" 3" 2" 1" More details and links to informative presentations at:"
41 Mahout Hands on Example" Recommendation Job example : Mahout.cmd" qsub Mahout.cmd to submit the job. " Script also illustrates use of a custom hadoop configuration file. Customized info in mapredsite.xml.stub file which is copied into the configuration used."
42 Mahout Hands On Example" The input file is a comma separated list of userids, itemids, and preference scores (test.data). As mentioned in earlier slide, all users need not score all items. In this case there are 5 users and 4 items. The users.txt file provides the list of users for whom we need recommendations." The output from the Mahout recommender job is a file with userids and recommendations (with scores)." The show_reco.py file uses the output and combines with user, item info from another input file (items.txt) to produce details recommendation info. This segment runs outside of the hadoop framework."
43 Mahout Sample Output" User ID : 2 Rated Items Item 2, rating=2 Item 3, rating=5 Item 4, rating= Recommended Items Item 1, score=
44 Summary" Hadoop framework extensively used for scalable distributed processing of large datasets. Several tools available that leverage the framework. Hadoop can be dynamically configured and run on Gordon using myhadoop framework. High speed Infiniband network allows for significant gains in performance using SSDs. TeraSort benchmark shows good performance and scaling on Gordon. Hadoop Streaming can be used to write mapper/reduce functions in any language. Mahout w/ Hadoop provides scalable machine learning tools.
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni, Rick Wagner, Eva Hocks, Christopher Irving, and Jerry Greenberg! SDSC! XSEDE13, July 22-25, 2013! Overview!!
Hadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
Introduction to SDSC systems and data analytics software packages "
Introduction to SDSC systems and data analytics software packages " Mahidhar Tatineni ([email protected]) SDSC Summer Institute August 05, 2013 Getting Started" System Access Logging in Linux/Mac Use available
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer
Hadoop Based Data Analysis Tools on SDSC Gordon Supercomputer Glenn Lockwood and Mahidhar Tatineni User Services Group San Diego Supercomputer Center XSEDE14, Atlanta July 14, 2014 Overview Hadoop framework
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
MapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
How To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster Dan Şerban Agenda :: Introduction - Real-world uses of MapReduce - The origins of Hadoop - Hadoop facts and architecture :: Part
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
RDMA for Apache Hadoop 0.9.9 User Guide
0.9.9 User Guide HIGH-PERFORMANCE BIG DATA TEAM http://hibd.cse.ohio-state.edu NETWORK-BASED COMPUTING LABORATORY DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING THE OHIO STATE UNIVERSITY Copyright (c)
Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013
Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
Enabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Hadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research
Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - -
The Maui High Performance Computing Center Department of Defense Supercomputing Resource Center (MHPCC DSRC) Hadoop Implementation on Riptide - - Hadoop Implementation on Riptide 2 Table of Contents Executive
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster
Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni ([email protected]) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
Hadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio [email protected] [email protected] Big Data Too much information! Big Data Explosive data growth proliferation of data capture
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
Sriram Krishnan, Ph.D. [email protected]
Sriram Krishnan, Ph.D. [email protected] (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC [email protected] [email protected] July 10, 2014
Hadoop/MapReduce Workshop Dan Mazur, McGill HPC [email protected] [email protected] July 10, 2014 1 Outline Hadoop introduction and motivation Python review HDFS - The Hadoop Filesystem MapReduce
Parallel Options for R
Parallel Options for R Glenn K. Lockwood SDSC User Services [email protected] Motivation "I just ran an intensive R script [on the supercomputer]. It's not much faster than my own machine." Motivation "I
How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
Hadoop/MapReduce Workshop
Hadoop/MapReduce Workshop Dan Mazur [email protected] Simon Nderitu [email protected] [email protected] August 14, 2015 1 Outline Hadoop introduction and motivation Python review HDFS
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics
Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering
E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache
Understanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
HADOOP MOCK TEST HADOOP MOCK TEST II
http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at
Deploying Hadoop with Manager
Deploying Hadoop with Manager SUSE Big Data Made Easier Peter Linnell / Sales Engineer [email protected] Alejandro Bonilla / Sales Engineer [email protected] 2 Hadoop Core Components 3 Typical Hadoop Distribution
Hadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze
Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI
Applied Storage Performance For Big Analytics PRESENTATION TITLE GOES HERE Hubbert Smith LSI It s NOT THIS SIMPLE!!! 2 Theoretical vs Real World Theoretical & Lab Storage Workloads I/O I/O I/O I/O I/O
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] June 3 rd, 2008 Who Am I? Hadoop Developer Core contributor since Hadoop s infancy Focussed
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms
Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,
Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.
EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
THE HADOOP DISTRIBUTED FILE SYSTEM
THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,
From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian
From Relational to Hadoop Part 1: Introduction to Hadoop Gwen Shapira, Cloudera and Danil Zburivsky, Pythian Tutorial Logistics 2 Got VM? 3 Grab a USB USB contains: Cloudera QuickStart VM Slides Exercises
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Distributed Filesystems
Distributed Filesystems Amir H. Payberah Swedish Institute of Computer Science [email protected] April 8, 2014 Amir H. Payberah (SICS) Distributed Filesystems April 8, 2014 1 / 32 What is Filesystem? Controls
HadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses
Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah
Pro Apache Hadoop Second Edition Sameer Wadkar Madhu Siddalingaiah Contents J About the Authors About the Technical Reviewer Acknowledgments Introduction xix xxi xxiii xxv Chapter 1: Motivation for Big
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
MapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside [email protected] About Journey Dynamics Founded in 2006 to develop software technology to address the issues
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
