Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008
|
|
|
- Kenneth Barrett
- 10 years ago
- Views:
Transcription
1 Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008
2 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer since April 2006 Core Committer (Map-Reduce) Member of the Hadoop PMC
3 Hadoop - Overview Hadoop includes: Distributed File System - distributes data Map/Reduce - distributes application Open source from Apache Written in Java Runs on Linux, Mac OS/X, Windows, and Solaris Commodity hardware
4 Distributed File System Designed to store large files Stores files as large blocks (64 to 128 MB) Each block stored on multiple servers Data is automatically re-replicated on need Accessed from command line, Java API, or C API bin/hadoop fs -put my-file hdfs://node1:50070/foo/bar Path p = new Path( hdfs://node1:50070/foo/bar ); FileSystem fs = p.getfilesystem(conf); DataOutputStream file = fs.create(p); file.writeutf( hello\n ); file.close();
5 Map-Reduce Map-Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: cat input grep sort unique -c cat > output Input Map Shuffle & Sort Reduce Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building
6 Map/Reduce features Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Introduces long tails or failures in computation Framework re-executes failed tasks Locality optimizations With big data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled local to the inputs when possible
7 Mappers and Reducers Every Map/Reduce program must specify a Mapper and typically a Reducer The Mapper has a map method that transforms input (key, value) pairs into any number of intermediate (key, value ) pairs The Reducer has a reduce method that transforms intermediate (key, value *) aggregates into any number of output (key, value ) pairs
8 Map/Reduce Dataflow
9 Example 45% of all Hadoop tutorials count words. 25% count sentences. 20% are about paragraphs. 10% are log parsers. The remainder are helpful
10 Example: Wordcount Mapper public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); } public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); output.collect(word, one); } }
11 Example: Wordcount Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { } public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }
12 Input and Output Formats A Map/Reduce may specify how it s input is to be read by specifying an InputFormat to be used InputSplit RecordReader A Map/Reduce may specify how it s output is to be written by specifying an OutputFormat to be used These default to TextInputFormat and TextOutputFormat, which process line-based text data SequenceFile: SequenceFileInputFormat and SequenceFileOutputFormat These are file-based, but they are not required to be
13 Configuring a Job Jobs are controlled by configuring JobConf JobConfs are maps from attribute names to string value The framework defines attributes to control how the job is executed. conf.set( mapred.job.name, MyApp ); Applications can add arbitrary values to the JobConf conf.set( my.string, foo ); conf.setinteger( my.integer, 12); JobConf is available to all of the tasks
14 Putting it all together Create a launching program for your application The launching program configures: The Mapper and Reducer to use The output key and value types (input types are inferred from the InputFormat) The locations for your input and output Optionally the InputFormat and OutputFormat to use The launching program then submits the job and typically waits for it to complete
15 Putting it all together public class WordCount { public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); // the keys are words (strings) conf.setoutputkeyclass(text.class); // the values are counts (ints) conf.setoutputvalueclass(intwritable.class);.. conf.setmapperclass(mapclass.class); conf.setreducerclass(reduce.class); conf.setinputpath(new Path(args[0]); conf.setoutputpath(new Path(args[1]); JobClient.runJob(conf);
16 Non-Java Interfaces Streaming Pipes (C++) Pig Hive Jaql Cascading
17 Streaming What about Unix hacks? Can define Mapper and Reduce using Unix text filters Typically use grep, sed, python, or perl scripts Format for input and output is: key \t value \n Allows for easy debugging and experimentation Slower than Java programs bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingmapper.sh -reducer streamingreducer.sh Mapper: /bin/sed -e 's \n g' /bin/grep. Reducer: /usr/bin/uniq -c /bin/awk '{print $2 "\t" $1}'
18 Pipes (C++) C++ API and library to link application with C++ application is launched as a sub-process of the Java task Keys and values are std::string with binary data Word count map looks like: class WordCountMap: public HadoopPipes::Mapper { public: WordCountMap(HadoopPipes::TaskContext& context){} void map(hadooppipes::mapcontext& context) { std::vector<std::string> words = HadoopUtils::splitString(context.getInputValue(), " "); for(unsigned int i=0; i < words.size(); ++i) { context.emit(words[i], "1"); }}};
19 Pipes (C++) The reducer looks like: class WordCountReduce: public HadoopPipes::Reducer { public: WordCountReduce(HadoopPipes::TaskContext& context){} void reduce(hadooppipes::reducecontext& context) { } }; int sum = 0; while (context.nextvalue()) { } sum += HadoopUtils::toInt(context.getInputValue()); context.emit(context.getinputkey(), HadoopUtils::toString(sum));
20 Pipes (C++) And define a main function to invoke the tasks: int main(int argc, char *argv[]) { return HadoopPipes::runTask( HadoopPipes::TemplateFactory<WordCountMap, WordCountReduce, void, WordCountReduce>()); }
21 Pig Hadoop Sub-project Scripting language that generates Map/Reduce jobs User uses higher level operations Group by Foreach Word Count: input = LOAD in-dir' USING TextLoader(); words = FOREACH input GENERATE FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE group, COUNT(words); STORE counts INTO out-dir ;
22 Hive Hadoop Sub-project SQL-like interface for querying tables stored as flat-files on HDFS, complete with a meta-data repository Developed at Facebook In the process of moving from Hadoop contrib to a stand-alone Hadoop sub-project
23 How many Maps and Reduces Maps Usually as many as the number of HDFS blocks being processed, this is the default Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying the minimum split size The actual sizes of the map inputs are computed by: Reduces max(min(block_size, data/#maps), min_split_size) Unless the amount of data being processed is small 0.95*num_nodes*mapred.tasktracker.reduce.tasks.maximum
24 Performance Example Bob wants to count lines in text files totaling several terabytes He uses Identity Mapper (input: text, output: same text) A single Reducer that counts the lines and outputs the total What is he doing wrong? This happened, really! I am not kidding!
25 Some handy tools Partitioners Combiners Compression Counters Speculation Zero reduces Distributed File Cache Tool
26 Partitioners Partitioners are application code that define how keys are assigned to reduces Default partitioning spreads keys evenly, but randomly Uses key.hashcode() % num_reduces Custom partitioning is often required, for example, to produce a total order in the output Should implement Partitioner interface Set by calling conf.setpartitionerclass(mypart.class) To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner
27 Combiners When maps produce many repeated keys It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often are the same class. Combiners must not have side effects, because they run an indeterminate number of times. In WordCount, conf.setcombinerclass(reduce.class);
28 Compression Compressing the outputs and intermediate data will often yield huge performance gains Can be specified via a configuration file or set programatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map outputs Compression Types (mapred.output.compression.type) for SequenceFiles block - Group of keys and values are compressed together record - Each value is compressed individually Block compression is almost always best Compression Codecs (mapred(.map)?.output.compression.codec) Default (zlib) - slower, but more compression LZO - faster, but less compression
29 Counters Often Map/Reduce applications have countable events For example, framework counts records in to and out of Mapper and Reducer To define user counters: static enum Counter {EVENT1, EVENT2}; reporter.incrcounter(counter.event1, 1); Define nice names in a MyClass_Counter.properties file CounterGroupName=My Counters EVENT1.name=Event 1 EVENT2.name=Event 2
30 Speculative execution The framework can run multiple instances of slow tasks Output from instance that finishes first is used Controlled by the configuration variable mapred.speculative.execution Can dramatically bring in long tails on jobs
31 Zero Reduces Frequently, we only need to run a filter on the input data No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and disk
32 Distributed File Cache Sometimes need read-only copies of data on the local computer. Downloading 1GB of data for each Mapper is expensive Define list of files you need to download in JobConf Files are downloaded once per a computer Add to launching program: DistributedCache.addCacheFile(new URI( hdfs://nn:8020/foo ), conf); Add to task: Path[] files = DistributedCache.getLocalCacheFiles(conf);
33 Tool Handle standard Hadoop command line options: -conf file - load a configuration file named file -D prop=value - define a single configuration property prop Class looks like: public class MyApp extends Configured implements Tool { public static void main(string[] args) throws Exception { System.exit(ToolRunner.run(new Configuration(), new MyApp(), args)); } public int run(string[] args) throws Exception {. getconf() } }
34 Debugging & Diagnosis Run job with the Local Runner Set mapred.job.tracker to local Runs application in a single process and thread Run job on a small data set on a 1 node cluster Can be done on your local dev box Set keep.failed.task.files to true This will keep files from failed tasks that can be used for debugging Use the IsolationRunner to run just the failed task Java Debugging hints Send a kill -QUIT to the Java process to get the call stack, locks held, deadlocks
35 Profiling Set mapred.task.profile to true Use mapred.task.profile.{maps reduces} hprof support is built-in Use mapred.task.profile.params to set options for the debugger Possibly use DistributedCache for the profiler s agent
36 Jobtracker front page
37 Job counters
38 Task status
39 Drilling down
40 Drilling down -- logs
41 Performance Is your input splittable? Gzipped files are NOT splittable Use compressed SequenceFiles Are partitioners uniform? Buffering sizes (especially io.sort.mb) Can you avoid Reduce step? Only use singleton reduces for very small data Use Partitioners and cat to get a total order Memory usage Please do not load all of your inputs into memory!
42 Q&A For more information: Website: Mailing lists: IRC: #hadoop on irc.freenode.org
Programming in Hadoop Programming, Tuning & Debugging
Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning
HowTo Hadoop. Devaraj Das
HowTo Hadoop Devaraj Das Hadoop http://hadoop.apache.org/core/ Hadoop Distributed File System Fault tolerant, scalable, distributed storage system Designed to reliably store very large files across machines
Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Introduction To Hadoop
Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the
Data management in the cloud using Hadoop
UT DALLAS Erik Jonsson School of Engineering & Computer Science Data management in the cloud using Hadoop Murat Kantarcioglu Outline Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics
Word count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
Data Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
Data-Intensive Computing with Hadoop. Thanks to: Milind Bhandarkar <[email protected]> Yahoo! Inc.
Data-Intensive Computing with Hadoop Thanks to: Milind Bhandarkar Yahoo! Inc. Agenda Hadoop Overview HDFS Programming Hadoop Architecture Examples Hadoop Streaming Performance Tuning
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
MR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
Hadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Hadoop for HPC Instructor: Ekpe Okorafor Outline Hadoop Basics Hadoop Infrastructure HDFS MapReduce Hadoop & HPC Hadoop
Introduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
By Hrudaya nath K Cloud Computing
Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer
Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the
Petabyte-scale Data with Apache HDFS. Matt Foley Hortonworks, Inc. [email protected]
Petabyte-scale Data with Apache HDFS Matt Foley Hortonworks, Inc. [email protected] Matt Foley - Background MTS at Hortonworks Inc. HDFS contributor, part of original ~25 in Yahoo! spin-out of Hortonworks
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
HPCHadoop: MapReduce on Cray X-series
HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
Hadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
Using Hadoop for Webscale Computing. Ajay Anand Yahoo! [email protected] Usenix 2008
Using Hadoop for Webscale Computing Ajay Anand Yahoo! [email protected] Agenda The Problem Solution Approach / Introduction to Hadoop HDFS File System Map Reduce Programming Pig Hadoop implementation
Cloudera Certified Developer for Apache Hadoop
Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number
Hadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations [email protected] What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
Hadoop Lab Notes. Nicola Tonellotto November 15, 2010
Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................
Word Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
Hadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Lecture 3 Hadoop Technical Introduction CSE 490H
Lecture 3 Hadoop Technical Introduction CSE 490H Announcements My office hours: M 2:30 3:30 in CSE 212 Cluster is operational; instructions in assignment 1 heavily rewritten Eclipse plugin is deprecated
MapReduce Programming with Apache Hadoop Viraj Bhat [email protected]
MapReduce Programming with Apache Hadoop Viraj Bhat [email protected] - 1 - Agenda Session I and II (8-12pm) Introduction Hadoop Distributed File System (HDFS) Hadoop Map-Reduce Programming Hadoop Architecture
Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship
Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Introduction to MapReduce and Hadoop
UC Berkeley Introduction to MapReduce and Hadoop Matei Zaharia UC Berkeley RAD Lab [email protected] What is MapReduce? Data-parallel programming model for clusters of commodity machines Pioneered
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
Data Science Analytics & Research Centre
Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Big Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?
Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed
Complete Java Classes Hadoop Syllabus Contact No: 8888022204
1) Introduction to BigData & Hadoop What is Big Data? Why all industries are talking about Big Data? What are the issues in Big Data? Storage What are the challenges for storing big data? Processing What
Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) [email protected]
Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) [email protected] The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Data-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
Mrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
Hadoop: The Definitive Guide
Hadoop: The Definitive Guide Tom White foreword by Doug Cutting O'REILLY~ Beijing Cambridge Farnham Köln Sebastopol Taipei Tokyo Table of Contents Foreword Preface xiii xv 1. Meet Hadoop 1 Da~! 1 Data
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015
MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents
MapReduce Tutorial. Table of contents
Table of contents 1 Purpose... 2 2 Prerequisites...2 3 Overview... 2 4 Inputs and Outputs... 3 5 Example: WordCount v1.0... 3 5.1 Source Code...3 5.2 Usage...6 5.3 Walk-through... 7 6 MapReduce - User
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
INTRODUCTION TO HADOOP
Hadoop INTRODUCTION TO HADOOP Distributed Systems + Middleware: Hadoop 2 Data We live in a digital world that produces data at an impressive speed As of 2012, 2.7 ZB of data exist (1 ZB = 10 21 Bytes)
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring 2014. HDFS Basics
COSC 6397 Big Data Analytics Distributed File Systems (II) Edgar Gabriel Spring 2014 HDFS Basics An open-source implementation of Google File System Assume that node failure rate is high Assumes a small
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Introduction to MapReduce
Introduction to MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Background: Cloud and distributed computing 2.
Hadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
Data Deluge. Billions of users connected through the Internet. Storage getting cheaper
Hadoop 1 Data Deluge Billions of users connected through the Internet WWW, FB, twitter, cell phones, 80% of the data on FB was produced last year Storage getting cheaper Store more data! Why Hadoop Drivers:
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
