Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
|
|
|
- Elisabeth Pope
- 10 years ago
- Views:
Transcription
1 Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro Summary The Apache Hadoop framework enables distributed processing of very large data sets. Hadoop is written in Java, and has limited methods to integrate with mapreducers written in other languages. This lab demonstrates how you can use JNBridgePro to program directly against the Java-based Hadoop API to create.net-based mapreducers. Hadoop mapreducers, Java, and.net Apache Hadoop (or Hadoop, for short) is an increasingly popular Java-based tool used to perform massively parallel processing and analysis of large data sets. Such large data sets requiring special processing techniques are often called Big Data. The analysis of very large log files is a typical example of a task suitable for Hadoop. When processed using Hadoop, the log files are broken into many chunks, then farmed out to a large set of processes called mappers, that perform identical operations on each chunk. The results of the mappers are then sent to another set of processes called reducers, which combine the mapper output into a unified result. Hadoop is well-suited to running on large clusters of machines, particularly in the cloud. Both Amazon EC2 and Microsoft Windows Azure, among other cloud offerings, either provide or are developing targeted support for Hadoop. In order to implement the functionality of a Hadoop application, the developer must write the mappers and reducers (sometimes collectively called mapreducers ), then plug them into the Hadoop framework through a well-defined API. Because the Hadoop framework is written in Java, most mapreducer development is also done in Java. While it s possible to write the mapreducers in languages other than Java, through a mechanism known as Hadoop Streaming, this isn t an ideal solution, as the data sent to the mapreducers over standard input needs to be parsed and then converted from text to whatever native form is being processed. Handling the data being passed through standard input and output incurs overhead, as well as additional coding effort. The alternative that we present in this lab is a way to create.net-based mapreducers by programming against the Hadoop API using JNBridgePro. In this lab, the.net-based mapreducers run in the actual Hadoop Java processes (which is possible if the Hadoop cluster is running on Windows machines), but we will also discuss ways to run the.net sides outside the Java processes. In this example, we show how to host the maximal amount of mapreducer functionality in.net, although you could use the same approach to host as much or as little of the functionality in.net as you like, and host the rest in Java. You will come 1
2 away with an understanding of how to create.net-based Hadoop mapreducers and deploy them as part of a Hadoop application. The code we provide can be used as a pattern upon which you can create your own.net-based mapreducers. You might want or need to write mapreducers in.net for a number of reasons. As examples, you might have an investment in.net-based libraries with specialized functionality that needs to be used in the Hadoop application. Your organization may have more developers with.net skills than with Java skills. You may be planning to run your Hadoop application on Windows Azure, where, even though the Hadoop implementation is still in Java and there is support for Java, the majority of the tooling is far more friendly to.net development. This lab is not a tutorial in Hadoop programming, or in deploying and running Hadoop applications. For the latter, there is a good tutorial here ( - Intro.html). The tutorial refers to some older versions of Eclipse and Hadoop, but will work with more recent versions. Even if you re familiar with Hadoop, the example builds on some of the setup in the tutorial, so it might be worth working through the tutorial beforehand. We will point out these dependencies when we discuss how the Hadoop job should be set up, so you can decide whether to build on top of the tutorial or make changes to the code in the lab. Example The lab is based on the standard word count example that comes with the Hadoop distribution, in which the occurrences of all words in a set of documents are counted. We ve chosen this example because it s often used in introductory Hadoop tutorials, and is usually well understood by Hadoop programmers. Consequently, we won t spend much time talking about the actual functionality of the example: that is, how the word counting actually works. What this example does is move all the Java-based functionality of the word count mapreducers into C#. As you will see, we need to leave a small amount of the mapreducer in Java as a thin layer. Understanding the necessity of this thin layer, and how it works, provides a design pattern that can be used in the creation of other.net-based mapreducers. Interoperability strategy At first glance, the apparent approach to interoperability would be to use JNBridgePro to proxy the Javabased Mapper and Reducer interfaces and the MapReduceBase abstract class into.net, then program in C# against these proxies. Then, still using JNBridgePro, proxy the.net-based mapper and reducer classes, and register those proxies with the Hadoop framework. The resulting project would use bidirectional interoperability and would be quite straightforward. Unfortunately, this approach leads to circularities and name clashes: the proxied mapper and reducer will contain proxied parameters with the same name as the actual Java-based Hadoop classes. In other words, there will be proxies of proxies, and the result will not work. While it is possible to edit the jar files and perform some other unusual actions, the result would be confusing and would not work in all cases. So we need to take a different approach. Instead, we will create thin Java-based wrapper classes implementing the Mapper and Reducer interfaces, which will interact with the hosting Hadoop framework, and which will also call the.net-based 2
3 functionality through proxies, making this a Java-to-.NET project. In the cases where the.net functionality needs to access Java-based Hadoop objects, particularly OutputCollectors and Iterators, it will be done indirectly, through callbacks. The resulting code is much simpler and more elegant. The original WordCount example Let s start with the original Java-based word count mapper and reducer, from the example that comes with Hadoop. We will not be using this code in our example, and we will not be discussing how it works (it should be fairly straightforward if you re familiar with Hadoop), but it will be useful as a reference when we move to the.net-based version. Here is the mapper: /** * WordCount mapper class from the Apache Hadoop examples. * Counts the words in each line. * For each line of input, break the line into words and emit them as * (word, 1). */ public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException String line = value.tostring(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasmoretokens()) word.set(itr.nexttoken()); output.collect(word, one); And here is the reducer: /** * From the Apache Hadoop examples. * A WordCount reducer class that just emits the sum of the input values. */ public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> 3
4 public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException int sum = 0; while (values.hasnext()) sum += values.next().get(); output.collect(key, new IntWritable(sum)); Migrating the functionality to.net What we want to do next is migrate as much of the mapper and reducer functionality as possible to.net (in this case, C#) code. Note that we can t migrate all of it verbatim; the Java code references Hadoopspecific classes like Text, IntWritable, LongWritable, OutputCollector, and Reporter, as well as other crucial Java classes such as Iterator. Text, IntWritable, and LongWritable can be easily converted to string, int, and long, which are automatically converted by JNBridgePro. However, while it is possible to convert classes like OutputCollector and Iterator to and from native.net classes like ArrayList, such conversions are highly inefficient, since they involve copying every element in the collection, perhaps multiple times. Instead, we will continue to use the original OutputCollector and Iterator classes on the Java side, and the.net code will only use them indirectly, knowing nothing about the actual classes. Callbacks provide a mechanism for doing this. Here is the C# code implementing the mapper and reducer functionality: namespace DotNetMapReducer // used for callbacks for the OutputCollector public delegate void collectresult(string thekey, int thevalue); // used for callbacks to the Iterator public delegate object getnextvalue(); // returns null if no more values, returns boxed integer otherwise public class DotNetMapReducer public void map(string line, collectresult resultcollector) StringTokenizer st = new StringTokenizer(line); while (st.hasmoretokens()) string nexttoken = st.nexttoken(); resultcollector(nexttoken, 1); 4
5 public void reduce(string key, getnextvalue next, collectresult resultcollector) int sum = 0; object nextvalue = next(); // get the next one, if there while (nextvalue!= null) sum += (int)nextvalue; nextvalue = next(); resultcollector(key, sum); public class StringTokenizer private static char[] defaultdelimiters = ' ', '\t', '\n', '\r', '\f' ; private string[] tokens; private int numtokens; private int curtoken; public StringTokenizer(string line, char[] delimiters) tokens = line.split(delimiters); numtokens = tokens.length; curtoken = 0; public StringTokenizer(string line) : this(line, defaultdelimiters) public bool hasmoretokens() if (curtoken < numtokens) return true; else return false; public string nexttoken() if (hasmoretokens()) return tokens[curtoken++]; else throw new IndexOutOfRangeException(); 5
6 StringTokenizer is just a.net-based reimplementation of the standard Java StringTokenizer class, and we won t be discussing it further. Note the two delegates collectresult and getnextvalue that are used by the mapreducer. These are ways to call back into the Java code for additional functionality, possibly using classes like OutputCollector and Iterator that the.net code knows nothing about. Also note that the.net code uses string and int where the Java code had Text and IntWritable (and LongWritable); the wrapper code will handle the conversions. Once we have the.net functionality built and tested, we need to proxy the mapreducer class and supporting classes. We then incorporate the proxy jar file, jnbcore.jar, and bcel-5.1-jnbridge.jar into our Java Hadoop project and can start writing the Java-based mapper and reducer wrappers. Here they are: public class MapperWrapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> private static MapReducerHelper mrh = new MapReducerHelper(); private DotNetMapReducer dnmr = new DotNetMapReducer(); public void map(longwritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException OutputCollectorHandler och = new OutputCollectorHandler(output); dnmr.map(value.tostring(), och); Callbacks.releaseCallback(och); public class ReducerWrapper extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> private static MapReducerHelper mrh = new MapReducerHelper(); private DotNetMapReducer dnmr = new DotNetMapReducer(); public void reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException IteratorHandler ih = new IteratorHandler(values); OutputCollectorHandler och = new OutputCollectorHandler(output); 6
7 dnmr.reduce(key.tostring(), ih, och); Callbacks.releaseCallback(ih); Callbacks.releaseCallback(och); Note that the only purpose of these thin wrappers is to interface with the Hadoop framework, host the.net-based functionality, and handle passing of values to and from the.net components (along with necessary conversions). There are two callback objects, IteratorHandler and OutputCollectorHandler, which encapsulate the Iterator and OutputCollector objects. These are passed where the.net map() and reduce() methods expect delegate parameters, and are used to hide the actual Hadoop Java types from the.net code. The mapreducer will simply call the resultcollector() or getnextvalue() delegate, and the action will be performed, or value returned, without the.net side knowing anything about the mechanism used by the action. Since callbacks consume resources (particularly, a dedicated thread for each callback object), and there can be many invocations of map() and reduce(), it is important to release the callback objects (using the Callbacks.releaseCallback() API) to release those threads when they are no longer needed. If you do not make those calls, performance will degrade substantially. Here is the Java code for the two callback classes: public class OutputCollectorHandler implements collectresult private OutputCollector<Text, IntWritable> outputcollector = null; public OutputCollectorHandler(OutputCollector<Text, IntWritable> thecollector) outputcollector = thecollector; public void Invoke(String thekey, int thevalue) try outputcollector.collect(new Text(theKey), new IntWritable(theValue)); catch(ioexception e) // not sure why it would throw IOException anyway 7
8 import System.BoxedInt; import System.Object; public class IteratorHandler implements getnextvalue private Iterator<IntWritable> theiterator = null; public IteratorHandler(Iterator<IntWritable> iterator) theiterator = iterator; // returns null if no more values, otherwise returns a boxed integer public Object Invoke() if (!theiterator.hasnext()) return null; else IntWritable iw = theiterator.next(); int i = iw.get(); return new BoxedInt(i); The two callback objects encapsulate the respective Java collections and perform the appropriate conversions when their Invoke() methods are called. The IteratorHandler, rather than providing the typical hasnext()/getnext() interface, has a single Invoke() method (this is how callbacks work in Java-to-.NET projects), so we ve written Invoke() to return null if there are no more objects, and to return the integer (boxed, so that it can be passed in place of a System.Object), when there is a new value. There are other ways you can choose to do this, but this method will work for iterators that return primitive objects. Finally, we need to configure JNBridgePro. For maximum flexibility, we ve chosen to configure it programmatically, through the MapReducerHelper class. Since configuration can only happen once in each process, and must happen before any proxy call, we ve created MapReducerHelper to perform the configuration operation inside its static initializer, which is executed when the class is loaded. This will happen only once per process and is guaranteed to be done before any proxies are called. Here is Javabased MapReducerHelper: 8
9 public class MapReducerHelper static Properties p = new Properties(); p.put("dotnetside.servertype", "sharedmem"); p.put("dotnetside.assemblylist.1", "C:/DotNetAssemblies/DotNetMapReducer.dll"); p.put("dotnetside.javaentry", "C:/Program Files/JNBridge/JNBridgePro v6.0/4.0-targeted/jnbjavaentry.dll"); p.put("dotnetside.appbase", "C:/Program Files/JNBridge/JNBridgePro v6.0/4.0-targeted"); DotNetSide.init(p); The paths in the configuration will likely be different in your deployment, so you will need to adjust them accordingly. Finally, we create the Java-based Hadoop driver in the usual way, specifying the new wrappers as the mapper and reducer classes: public class WordCountDotNetDriver public static void main(string[] args) JobClient client = new JobClient(); JobConf conf = new JobConf(WordCountDotNetDriver.class); // specify output types conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setinputformat(textinputformat.class); conf.setoutputformat(textoutputformat.class); // specify input and output DIRECTORIES (not files) FileInputFormat.setInputPaths(conf, new Path("In")); FileOutputFormat.setOutputPath(conf, new Path("Out")); // specify a mapper conf.setmapperclass(mapperwrapper.class); conf.setcombinerclass(reducerwrapper.class); 9
10 // specify a reducer conf.setreducerclass(reducerwrapper.class); client.setconf(conf); try JobClient.runJob(conf); catch (Exception e) e.printstacktrace(); Deploying and running the Hadoop application At this point we can deploy and run the application. Start by deploying the application to your Hadoop cluster. (It needs to be a cluster of Windows machines, since we re using shared memory. We ll talk about Linux clusters later.) If you have a setup like the one described in the aforementioned tutorial, just copy the source files over and build. Make sure that the appropriate version (x86 or x64, depending on whether you re running your Hadoop on 32-bit or 64-bit Java) of JNBridgePro is installed on all the machines in the cluster and that an appropriate license (which may be an evaluation license) is installed, and make sure that the dotnetside.javaentry and dotnetside.appbase properties in MapReducerHelper agree with the installed location of JNBridgePro. If not, either install JNBridgePro in the correct place, or edit MapReducerHelper and rebuild. You will need to put the proxy jar file as well as jnbcore.jar and bcel-5.1-jnbridge.jar in the Hadoop classpath. There are a couple of ways to do this. You can add the paths to these files to the HADOOP_CLASSPATH environment variable. Alternatively, you can copy these jar files to the Hadoop lib folder. Finally, copy to each machine in the Hadoop cluster the.net DLL file containing the.net classes, and put it in the location specified in the dotnetside.assemblylist.1 property in MapReducerHelper. Once this is all done, start up all your Hadoop nodes. Make sure that in your HDFS service you ve created a directory In, and that you ve uploaded all the.txt files in the root Hadoop folder: mostly licensing information, release notes, and readme documents. Feel free to load additional documents. If there is an Out directory, delete it along with its contents. (If the Out directory exists when the program is run, an exception will be thrown.) Now, run the Hadoop application. It will run to completion, and you will find an HDFS folder named Out containing a document with the result. The Hadoop job that you just ran worked in exactly the same way as any ordinary all-java Hadoop job, but the mapreducer functionality was written in.net and was running in the same processes as the rest of the Hadoop framework. No streaming was required to do this, and we were able to program against native Hadoop APIs. 10
11 Running the Hadoop job on Linux machines As we ve chosen to run the Hadoop application using JNBridgePro shared memory communications, we need to run our Hadoop cluster on Windows machines. This is because the.net Framework needs to be installed on the machines on which the Hadoop processes are running. It is possible to run the application on a cluster of Linux machines, but you will need to change the configuration to use tcp/binary communications, and then run.net-side processes on one or more Windows machines. The simplest way to run a Java side is to configure and use the JNBDotNetSide.exe utility that comes with the JNBridgePro installation. Configure each Java side to point to one of the.netside machines. You can share a.net side among multiple Java sides without any problem, although the fewer Java sides talking to each.net side, the better performance you will get. Note that changing from shared memory to tcp/binary does not require any changes to your.net or Java code. You can use the same binaries as before; you only need to change the configuration. Conclusion This lab has shown how you can write.net-based Hadoop mapreducers without having to use Hadoop streaming or implement parsing of the stream. The.NET code can include as much or as little of the mapreducer functionality as you desire; the rest can be placed in Java-based wrappers. In the example we ve worked through, the.net code contains all of the mapreducer functionality except for the minimal functionality required for connectivity with the Hadoop framework itself. The.NET code can run in the same processes as the rest of the Hadoop application (in the case that Hadoop is running on a Windows cluster), or on different machines if Hadoop is running on a Linux cluster. You can use the example code as a generalized pattern for creating the wrappers that connect the Hadoop framework to the.net-based mapreducers. The code is simple and straightforward, and variants will work in most mapreduce scenarios. You can enhance the provided code for additional scenarios. For example, if you want to use tcp/binary, you can modify the Java-side configuration (in class MapReducerHelper) so that any specific instance of the Java side can choose to connect to one of a set of.net sides running on a cluster; the assignments do not have to be made by hand. You can also use the provided code to support tight integration of.net mapreducers in a Hadoop application running on Windows Azure. This approach provides more flexibility and improved performance over the Hadoop streaming used by the Azure implementation. We expect that the provided code and design patterns will be useful in many scenarios we haven t even thought of. We d love to hear your comments, suggestions, and feedback you can contact us at [email protected]. You can download the source code for the lab at Copyright 2012 JNBridge, LLC. All rights reserved. JNBridge and JNBridgePro are trademarks of JNBridge, LLC. All other products are the trademarks or registered trademarks of their respective owners. 11
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
Word count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
Introduction To Hadoop
Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the
Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
MR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)
MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output
Word Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop
Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop Thomas Brenner, 08-928-434 1 Introduction+and+Task+ Temporal databases are databases expanded with a time dimension in order to
HPCHadoop: MapReduce on Cray X-series
HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology
Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology
Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming
Data Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
How To Write A Map In Java (Java) On A Microsoft Powerbook 2.5 (Ahem) On An Ipa (Aeso) Or Ipa 2.4 (Aseo) On Your Computer Or Your Computer
Lab 0 - Introduction to Hadoop/Eclipse/Map/Reduce CSE 490h - Winter 2007 To Do 1. Eclipse plug in introduction Dennis Quan, IBM 2. Read this hand out. 3. Get Eclipse set up on your machine. 4. Load the
Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) [email protected]
Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) [email protected] The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services
Hadoop: Understanding the Big Data Processing Method
Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology
How To Write A Mapreduce Program On An Ipad Or Ipad (For Free)
Course NDBI040: Big Data Management and NoSQL Databases Practice 01: MapReduce Martin Svoboda Faculty of Mathematics and Physics, Charles University in Prague MapReduce: Overview MapReduce Programming
Hadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008
Programming Hadoop Map-Reduce Programming, Tuning & Debugging Arun C Murthy Yahoo! CCDI [email protected] ApacheCon US 2008 Existential angst: Who am I? Yahoo! Grid Team (CCDI) Apache Hadoop Developer
Big Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang
Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science
By Hrudaya nath K Cloud Computing
Processing Big Data with Map Reduce and HDFS By Hrudaya nath K Cloud Computing Some MapReduce Terminology Job A full program - an execution of a Mapper and Reducer across a data set Task An execution of
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Hadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights
Programming in Hadoop Programming, Tuning & Debugging
Programming in Hadoop Programming, Tuning & Debugging Venkatesh. S. Cloud Computing and Data Infrastructure Yahoo! Bangalore (India) Agenda Hadoop MapReduce Programming Distributed File System HoD Provisioning
Introduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
Demo: Controlling.NET Windows Forms from a Java Application. Version 7.3
Demo: Controlling.NET Windows Forms from a Java Application Version 7.3 JNBridge, LLC www.jnbridge.com COPYRIGHT 2002 2015 JNBridge, LLC. All rights reserved. JNBridge is a registered trademark and JNBridgePro
Hadoop, Hive & Spark Tutorial
Hadoop, Hive & Spark Tutorial 1 Introduction This tutorial will cover the basic principles of Hadoop MapReduce, Apache Hive and Apache Spark for the processing of structured datasets. For more information
Mrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
Building a distributed search system with Apache Hadoop and Lucene. Mirko Calvaresi
Building a distributed search system with Apache Hadoop and Lucene Mirko Calvaresi a Barbara, Leonardo e Vittoria 2 Index Preface... 5 1 Introduction: the Big Data Problem... 6 1.1 Big data: handling the
How To Write A Map Reduce In Hadoop Hadooper 2.5.2.2 (Ahemos)
Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1 Content Map Function Reduce Function Why Hadoop HDFS Map Reduce Hadoop Some Questions 2 What is Map
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May
2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite
Distributed Model Checking Using Hadoop
Distributed Model Checking Using Hadoop Rehan Abdul Aziz October 25, 2010 Abstract Various techniques for addressing state space explosion problem in model checking exist. One of these is to use distributed
University of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Hadoop Installation MapReduce Examples Jake Karnes
Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an
Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
BIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
Running Hadoop on Windows CCNP Server
Running Hadoop at Stirling Kevin Swingler Summary The Hadoopserver in CS @ Stirling A quick intoduction to Unix commands Getting files in and out Compliing your Java Submit a HadoopJob Monitor your jobs
Getting Started with the Internet Communications Engine
Getting Started with the Internet Communications Engine David Vriezen April 7, 2014 Contents 1 Introduction 2 2 About Ice 2 2.1 Proxies................................. 2 3 Setting Up ICE 2 4 Slices 2
Assignment 1 Introduction to the Hadoop Environment
Assignment 1 Introduction to the Hadoop Environment Elements: 1. Get the tools you need to complete Hadoop activities 2. Run and understand a word counting program 3. Design, implement, and understand
Hadoop & Pig. Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship
Hadoop & Pig Dr. Karina Hauser Senior Lecturer Management & Entrepreneurship Outline Introduction (Setup) Hadoop, HDFS and MapReduce Pig Introduction What is Hadoop and where did it come from? Big Data
Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps
Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm
Introduc8on to Apache Spark
Introduc8on to Apache Spark Jordan Volz, Systems Engineer @ Cloudera 1 Analyzing Data on Large Data Sets Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc. Why are these
Hadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 2/20/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
Hadoop Lab Notes. Nicola Tonellotto November 15, 2010
Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................
Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch November 11, 2013 10-11-2013 1
Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch November 11, 2013 10-11-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time (Google
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
Big Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
Hands-on Exercises with Big Data
Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In
To reduce or not to reduce, that is the question
To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE
BIG DATA STATE OF THE ART: SPARK AND THE SQL RESURGENCE Dean Wampler, Ph.D. Typesafe Dean Wampler Programming Hive Functional Programming for Java Developers Dean Wampler Dean Wampler, Jason Rutherglen
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
AVRO - SERIALIZATION
http://www.tutorialspoint.com/avro/avro_serialization.htm AVRO - SERIALIZATION Copyright tutorialspoint.com What is Serialization? Serialization is the process of translating data structures or objects
Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...
Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.
Demo: Embedding Windows Presentation Foundation elements inside a Java GUI application. Version 7.3
Demo: Embedding Windows Presentation Foundation elements inside a Java GUI application Version 7.3 JNBridge, LLC www.jnbridge.com COPYRIGHT 2002 2015 JNBridge, LLC. All rights reserved. JNBridge is a registered
Data Science Analytics & Research Centre
Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014. Hadoop for HPC. Instructor: Ekpe Okorafor
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Hadoop for HPC Instructor: Ekpe Okorafor Outline Hadoop Basics Hadoop Infrastructure HDFS MapReduce Hadoop & HPC Hadoop
Cloud Computing. Chapter 8. 8.1 Hadoop
Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
MarkLogic Server. MarkLogic Connector for Hadoop Developer s Guide. MarkLogic 8 February, 2015
MarkLogic Connector for Hadoop Developer s Guide 1 MarkLogic 8 February, 2015 Last Revised: 8.0-3, June, 2015 Copyright 2015 MarkLogic Corporation. All rights reserved. Table of Contents Table of Contents
Overview of Web Services API
1 CHAPTER The Cisco IP Interoperability and Collaboration System (IPICS) 4.5(x) application programming interface (API) provides a web services-based API that enables the management and control of various
BIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)
Use Data from a Hadoop Cluster with Oracle Database Hands-On Lab Lab Structure Acronyms: OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS) All files are
Download and install Download virtual machine Import virtual machine in Virtualbox
Hadoop/Pig Install Download and install Virtualbox www.virtualbox.org Virtualbox Extension Pack Download virtual machine link in schedule (https://rmacchpcsymposium2015.sched.org/? iframe=no) Import virtual
Hadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.1.x Document Release Date: 12/9/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
Hadoop Streaming. Table of contents
Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5
CSCI6900 Assignment 2: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use
Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem
Map Reduce a Programming Model for Cloud Computing Based On Hadoop Ecosystem Santhosh voruganti Asst.Prof CSE Dept,CBIT, Hyderabad,India Abstract Cloud Computing is emerging as a new computational paradigm
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis
Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis Prabin R. Sahoo Tata Consultancy Services Yantra Park, Thane Maharashtra, India ABSTRACT Hadoop Distributed
Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay
Hadoop Tutorial Group 7 - Tools For Big Data Indian Institute of Technology Bombay Dipojjwal Ray Sandeep Prasad 1 Introduction In installation manual we listed out the steps for hadoop-1.0.3 and hadoop-
About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. AVRO Tutorial
i About the Tutorial Apache Avro is a language-neutral data serialization system, developed by Doug Cutting, the father of Hadoop. This is a brief tutorial that provides an overview of how to set up Avro
