Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Size: px
Start display at page:

Download "Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project"

Transcription

1 Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project Paul Bone June 2008 Contents 1 Introduction 1 2 Method Hadoop and Python Data and MapReduce Implementation Dependencies Hadoop Pipes 4 Python Distributional Similarity Test Case Results 5 5 Evaluation 6 References 7 Appendices 8 A Dependencies 8 1 Introduction Modern Natural Language Processing (NLP) makes heavy use of prepared data in order to gather statistics. Many NLP analyses require large amounts of data in order to gather meaningful statistics. Many analyses are also computationally complex over a sentence, or over an entire set of data. These two factors can 1

2 make NLP very slow. This problem may be made worse as more data is collected for use in NLP. Cluster computing makes it easier to process large data sets. The Natural Language Toolkit (NLTK) is a framework and suite of software for NLP [4], whilst Hadoop is an open source MapReduce framework [1]. MapReduce style clustering is especially suited to processing large amounts of data [8]. It is also able to scale well as new cluster nodes are added, which may be necessary as a dataset grows. This report shows how NLTK and Hadoop can be combined to allow processing of very large datasets. 2 Method 2.1 Hadoop and Python NLTK is implemented in the Python programming language, whereas Hadoop is implemented in Java. Both these languages have separate runtime systems, which makes it difficult to integrate them directly. Jython is an implementation of Python in Java [10]. Using Jython makes it very easy to integrate both languages and make function calls call from one into the other. However Jython does not yet fully implement the Python 2.4 language which NLTK is written in. It is possible to back port NLTK to Python 2.2, however doing so would be a lot of effort and would introduce maintenance problems for NLTK. Hadoop provides two other ways of interacting with it. The first is Hadoop Streaming [7]. This allows using a regular Unix process to communicate with Hadoop over standard input and standard output. However this will not correctly handle arbitrary data, since it uses new line characters to separate records, and tab characters to separate keys from values. The other option is called Hadoop Pipes [6], a C++ library that communicates with Hadoop. It is more flexible than Hadoop Streaming, and is Swig-able. This means that an interface for a scripting language such as Python can be generated by the Swig [2] tool. A Python program can use the Hadoop Pipes C++ library via a Swig generated interface. The Hadoop Pipes library has other benefits. For example, it allows a programmer to write a Combiner 1 in C++. It may also be possible to extend Hadoop Pipes to allow Python programs to directly access the Hadoop distributed file-system. 2.2 Data and MapReduce MapReduce [8] is a recent (and simple) idea for parallelising a computation across a cluster, as long as that computation is Embarrassingly-Parallel. We 1 a reduce-like task that runs inside the map task, see Section 2.2 for information about Hadoop tasks 2

3 define an Embarrassingly-Parallel problem as a problem that can easily be divided into a large number of parts to be computed in parallel. In cases where a large amount of data is the input of the computation, the data is split up into a number of map tasks. One or more reduce tasks then run to aggregate the results of the map tasks into a single set of results. 2 Hadoop must split up input data when creating map tasks. By default this works on simple text files; extra formats can be supported by writing Java code that extends Hadoop. NLTK is distributed with many corpora 3, stored in different formats. Rather than writing Java code to read and write each individual format, it is easier to create a single new format capable of storing arbitrary data. NLTK can be used to process different types of data: words, sentences, and annotated text, including text in different languages. It is necessary to preserve all annotations and encodings in the data format to be used with Hadoop. Using the binary encoding provided by Python s Pickle 4 library any Python object can be encoded into a binary representation. By reserving a byte (0xFF) as a marker to begin records a Hadoop task to seek into any arbitrary position in the file, scan for a marker and begin processing records from that position. To prevent this marker from occurring within pickled data we replace it with a two byte sequence 0xFE 0xFD, and replace any 0xFE byte with the sequence 0xFE 0xFE. Each record is made up of a key and value pair. The marker byte is followed by the key, then by the value. In order for Hadoop to understand where the key ends and the value begins both are preceded with their size as measured in bytes. The size is encoded using only seven bits per byte to avoid producing the marker byte 5. Hadoop sorts and groups records by their key between the map and reduce phases. This ensures that the data with the same key is collected together and sent to the same reduce task in a sequence. Storing Python objects as Pickle data means that this sorting may not work as expected, however the grouping of data will not be affected. Reduce tasks will have to take this into consideration. 3 Implementation This project is build out of several components. Each component is described below, along with how to build and configure it. 3.1 Dependencies A complete list of dependencies can be found in Appendix A. Installation of most of these is simple, and available from the OS s package manager. Config- 2 The names map and reduce are inspired by functions that do a similar (often nonparallel) task in functional languages. 3 multiple collections of texts 4 Pickle is the Python term for object serialisation 5 There are better ways to encode numbers, this method was selected for it s simplicity 3

4 $ python setup.py build $ sudo python setup.py install Figure 1: Build and install the HP4P code uration of Hadoop is non-trivial, so extra information is provided below. At least Hadoop 0.17 is required to use Hadoop Pipes 4 Python. It can be downloaded from See the Hadoop documentation [1] and the NLTK documentation [3] for more information and installation instructions. deploy.py is provided with Hadoop Pipes 4 Python to aid in the deployment and configuration of a Hadoop cluster. Place this file and the Hadoop.tar.gz file in the same directory. Edit deploy.py to configure it s settings. Instructions are provided within deploy.py. When ready, execute deploy.py with a Python interpreter, and copy the resulting directory and deploy.py to a consistent location on each of the cluster nodes (consider a NFS-hosted location). Run deploy.py -s to start the cluster. 3.2 Hadoop Pipes 4 Python Hadoop Pipes 4 Python (HP4P) is the main component of this project. It is implemented in Java, C++ and Python. HP4P is a derived work of Hadoop Pipes [6] with minor modifications and extensions. Original work to handle the data encoding problems described in section 2.2 has been added to HP4P. The Encode module of HP4P defines functions for encoding data for use on the cluster, by pickling Python objects to strings and encoding the resulting strings before sending them to Java code in HP4P. The Java code adds the encoded size and prefixes every key-value pair with a marker byte. The reverse process occurs to send data to the map and reduce components of a program. HP4P also contains a Python library that can perform the entire encoding before uploading data to the cluster. As of the time of writing, this is not yet part of NLTK or Hadoop. It is intended that one of these projects will host and maintain this code, and currently no URL can be provided for this work. In order to build and install the Python section of this work, execute commands as seen in Figure 1 in the root directory of Hadoop Pipes 4 Python. This will also compile the C++ code and prepare Python bindings using it. To build the Java section change into the Java subdirectory of the project and edit the build.xml file, instructions are provided within this file. When done run ant to compile the Java code. Copy the resulting file (hdp4p.jar) to the directory from which programs will be run on the cluster. 4

5 3.3 Distributional Similarity Test Case To demonstrate this project a distributional similarity program has been written. A corpus can be scanned for all the uses of a word and the contexts where the word appears are returned. In a second pass over the corpus the most frequent contexts are found and the words appearing in those are returned. The result of this process is the words that are commonly seen in the same context as the first word. What is considered a context may vary between implementations. This program uses the word before and the word after. A simpler algorithm using only a single pass can be found in [5]. This algorithm requires two passes over the data; the first to find the contexts, and the second to find other words used in those contexts. This is performed over a portion of the Gigaword Corpus [9], using 6 years worth of New York Times articles, which is roughly 2.1GB when compressed. This data is made up of stories, each of which has a by-line and multiple paragraphs. The data is not otherwise processed. Therefore it is necessary that the input be tokenised. This is done in both phases and the tokenised text is lost after each phase. The map and reduce tasks for each of the jobs are global strings, use of combine phases can also be seen. Within the main() function calls to Hadoop Pipes 4 Python are made, these create the job, describe the data, run the job and retrieve and process the data. HP4P implements a lot of these features by executing the hadoop script, as a user would do to control the cluster. A control has been established by modifying this program so that it does not use the cluster. The cluster-specific parts of the program have been removed, however the map and reduce interface has been kept and code has been added to use this interface to drive the program. This made the modifications as simple as possible, and less likely to introduce bugs. This sequential program also uses the HP4P library to read and decode the input data from the distributed file system. 4 Results The times taken to run the programs discussed in section 3.3 are compared. Both are running on the same hardware, except that dist sim.py runs on a 20 node cluster. However two of these nodes manage the cluster while each of the other nodes store data and run tasks. Several runs are made for each program and the wall-time is recorded for each run. The average times can be seen in Table 1. These times exclude converting the data into a suitable format for use on the cluster. Interestingly, each run of the distributed program recorded a slower time than the previous run. This can be seen in Table 2. This phenomenon is worth investigating, however it is left for further work. Six instances of the sequential program were started on separate cluster nodes so that complete results would be available sooner. All of these failed to 5

6 Program Nodes Runtime (seconds) Average Standard Deviation dist sim seq.py 1 Killed after 24 hours N/A dist sim.py Table 1: Average runtine of distributional simulairty programs. Run Runtime (seconds) Table 2: Runtimes of dist sim.py program. complete within 24 hours and were killed. It is obvious that the cluster version of the program is faster. However the sequential version should not have been as slow as it was. 18 cluster nodes were used to speed up the distributed program, therefore at a worst case the sequential program should be 18 times slower. The entire sequential program was written in Python, yet the clustered version included Java and C++ components. This suggests that by implementing parts of the program in Java and C++ additional performance has been achieved. 5 Evaluation Encoding the data was slower than it should be. The data encoding work was created to avoid an individual node scanning through large amounts of data in order to reach the section it is interested in. Given that the encoding process is slow, it may be faster for each process to naively scan through all the data. However this does not allow arbitrary objects to be encoded using Python s Pickle library. Alternatively, it is possible to use the cluster to encode the data into the correct format, making this process much faster. This was done to encode the Gigaword data for use in the experiment, the distributed program to encode the Gigaword data was written solely in Java. When running MapReduce jobs, a minimal speed increase may be available by moving part of the string encoding work from Python into Java. This cannot be done to some of the code since Python code will still need to pickle objects. Unfortunately the sequential version of the program did not finish, and the clustered version became slower each time it was executed. This reduces the amount of confidence that can be put into these results. These problems should be investigated. Also, different programs should be tested with different numbers of cluster nodes before the benefits of this work can be clear. 6

7 Hadoop Pipes 4 Python has some rough edges. Further work may include: making it easier to setup and use, as well as implementing new features. For example, it should allow a user to place the hp4p.jar file in the Hadoop home directory and it should also be simpler to use the output of one job, as the input to another. These are rather minor changes. It may also be possible to extend HP4P to allow MapReduce jobs to access the distributed file system directly. This would be a large change involving all levels of HP4P This report has shown that it is possible to use the Hadoop Clustering software with NLTK, and more generally with any Python program. Measurements have shown that this greatly improves the performance of the example program. More testing is required before this conclusion can be extended to other NLP and Python programs. References [1] The Apache Software Foundation, The Apache Software Foundation 1901 Munsey Drive Forest Hill, MD U.S.A. Hadoop 0.17 Documentation, edition, May r0.17.0/. [2] David M. Beazley. SWIG 1.1 Users Manual. Department of Computer Science University of Utah, Salt Lake City, Utah 84112, 1.1 edition, June [3] Steven Bird, Ewan Klein, and Edward Loper. NLTK Documentation. Online: accessed April 2008, [4] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing in Python. Draft Avalible Online, Book. [5] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing in Python, chapter four. Draft Avalible Online, doc/en/tag.html. [6] The Apache Software Foundation. Hadoop pipes. Online Manual, April apache/hadoop/mapred/pipes/package-summary.html. [7] The Apache Software Foundation. Hadoop streaming. Online Manual, April streaming.html. [8] Google. MapReduce: Simplified Data Processing on Large Clusters, December

8 [9] David Graff. English gigaword. one DVD of structured text, Jan ISBN: [10] Jython Project. Jython FAQ. Online: accessed April 2008, jython.org/project/userfaq.html. Appendices A Dependencies The following basic dependencies are required. C/C++ Compiler No particular compiler required. Python 2.4 See Java 1.5 (Tested with Sun Java) See Apache Ant See Swig See and [2]. Apache Hadoop 0.17 See and [1]. NLTK is recommended for writing NLP programs, It can be downloaded from see also [4]. 8

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September 2014. National Institute of Standards and Technology (NIST)

NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop. September 2014. National Institute of Standards and Technology (NIST) NIST/ITL CSD Biometric Conformance Test Software on Apache Hadoop September 2014 Dylan Yaga NIST/ITL CSD Lead Software Designer Fernando Podio NIST/ITL CSD Project Manager National Institute of Standards

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Hadoop Streaming. Table of contents

Hadoop Streaming. Table of contents Table of contents 1 Hadoop Streaming...3 2 How Streaming Works... 3 3 Streaming Command Options...4 3.1 Specifying a Java Class as the Mapper/Reducer... 5 3.2 Packaging Files With Job Submissions... 5

More information

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce

More information

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Data Intensive Computing Handout 6 Hadoop

Data Intensive Computing Handout 6 Hadoop Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

The Quest for Conformance Testing in the Cloud

The Quest for Conformance Testing in the Cloud The Quest for Conformance Testing in the Cloud Dylan Yaga Computer Security Division Information Technology Laboratory National Institute of Standards and Technology NIST/ITL Computer Security Division

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

A. Aiken & K. Olukotun PA3

A. Aiken & K. Olukotun PA3 Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

CDH installation & Application Test Report

CDH installation & Application Test Report CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: she@scu.edu) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest

More information

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP MOCK TEST HADOOP MOCK TEST II http://www.tutorialspoint.com HADOOP MOCK TEST Copyright tutorialspoint.com This section presents you various set of Mock Tests related to Hadoop Framework. You can download these sample mock tests at

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

PMOD Installation on Linux Systems

PMOD Installation on Linux Systems User's Guide PMOD Installation on Linux Systems Version 3.7 PMOD Technologies Linux Installation The installation for all types of PMOD systems starts with the software extraction from the installation

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Data Intensive Computing Handout 5 Hadoop

Data Intensive Computing Handout 5 Hadoop Data Intensive Computing Handout 5 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf Rong Gu,Qianhao Dong 2014/09/05 0. Introduction As we want to have a performance framework for Tachyon, we need to consider two aspects

More information

Setting up Hadoop with MongoDB on Windows 7 64-bit

Setting up Hadoop with MongoDB on Windows 7 64-bit SGT WHITE PAPER Setting up Hadoop with MongoDB on Windows 7 64-bit HCCP Big Data Lab 2015 SGT, Inc. All Rights Reserved 7701 Greenbelt Road, Suite 400, Greenbelt, MD 20770 Tel: (301) 614-8600 Fax: (301)

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

How to Run Spark Application

How to Run Spark Application How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a

More information

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1

研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊. Version 0.1 102 年 度 國 科 會 雲 端 計 算 與 資 訊 安 全 技 術 研 發 專 案 原 始 程 式 碼 安 裝 及 操 作 手 冊 Version 0.1 總 計 畫 名 稱 : 行 動 雲 端 環 境 動 態 群 組 服 務 研 究 與 創 新 應 用 子 計 畫 一 : 行 動 雲 端 群 組 服 務 架 構 與 動 態 群 組 管 理 (NSC 102-2218-E-259-003) 計

More information

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2. EDUREKA Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.0 Cluster edureka! 11/12/2013 A guide to Install and Configure

More information

Integrating VoltDB with Hadoop

Integrating VoltDB with Hadoop The NewSQL database you ll never outgrow Integrating with Hadoop Hadoop is an open source framework for managing and manipulating massive volumes of data. is an database for handling high velocity data.

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government

More information

Planning the Installation and Installing SQL Server

Planning the Installation and Installing SQL Server Chapter 2 Planning the Installation and Installing SQL Server In This Chapter c SQL Server Editions c Planning Phase c Installing SQL Server 22 Microsoft SQL Server 2012: A Beginner s Guide This chapter

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

IST Amigo Project. Accounting & Billing Software Developer s Guide. Public

IST Amigo Project. Accounting & Billing Software Developer s Guide. Public IST Amigo Project Accounting & Billing Software Developer s Guide Project Number : IST-004182 Project Title : Amigo Deliverable Type : Report Deliverable Number : Title of Deliverable : Nature of Deliverable

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

Base Conversion written by Cathy Saxton

Base Conversion written by Cathy Saxton Base Conversion written by Cathy Saxton 1. Base 10 In base 10, the digits, from right to left, specify the 1 s, 10 s, 100 s, 1000 s, etc. These are powers of 10 (10 x ): 10 0 = 1, 10 1 = 10, 10 2 = 100,

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Witango Application Server 6. Installation Guide for OS X

Witango Application Server 6. Installation Guide for OS X Witango Application Server 6 Installation Guide for OS X January 2011 Tronics Software LLC 503 Mountain Ave. Gillette, NJ 07933 USA Telephone: (570) 647 4370 Email: support@witango.com Web: www.witango.com

More information

Assignment # 1 (Cloud Computing Security)

Assignment # 1 (Cloud Computing Security) Assignment # 1 (Cloud Computing Security) Group Members: Abdullah Abid Zeeshan Qaiser M. Umar Hayat Table of Contents Windows Azure Introduction... 4 Windows Azure Services... 4 1. Compute... 4 a) Virtual

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Hands-on Exercises with Big Data

Hands-on Exercises with Big Data Hands-on Exercises with Big Data Lab Sheet 1: Getting Started with MapReduce and Hadoop The aim of this exercise is to learn how to begin creating MapReduce programs using the Hadoop Java framework. In

More information

Hadoop Distributed File System Propagation Adapter for Nimbus

Hadoop Distributed File System Propagation Adapter for Nimbus University of Victoria Faculty of Engineering Coop Workterm Report Hadoop Distributed File System Propagation Adapter for Nimbus Department of Physics University of Victoria Victoria, BC Matthew Vliet

More information

VMware Server 2.0 Essentials. Virtualization Deployment and Management

VMware Server 2.0 Essentials. Virtualization Deployment and Management VMware Server 2.0 Essentials Virtualization Deployment and Management . This PDF is provided for personal use only. Unauthorized use, reproduction and/or distribution strictly prohibited. All rights reserved.

More information

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

High Performance Computing MapReduce & Hadoop. 17th Apr 2014 High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Sonatype CLM Enforcement Points - Continuous Integration (CI) Sonatype CLM Enforcement Points - Continuous Integration (CI)

Sonatype CLM Enforcement Points - Continuous Integration (CI) Sonatype CLM Enforcement Points - Continuous Integration (CI) Sonatype CLM Enforcement Points - Continuous Integration (CI) i Sonatype CLM Enforcement Points - Continuous Integration (CI) Sonatype CLM Enforcement Points - Continuous Integration (CI) ii Contents 1

More information

Cloud Computing. Chapter 8. 8.1 Hadoop

Cloud Computing. Chapter 8. 8.1 Hadoop Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

More information

Introduc)on to Hadoop

Introduc)on to Hadoop Introduc)on to Hadoop Slides compiled from: Introduc)on to MapReduce and Hadoop Shivnath Babu Experiences with Hadoop and MapReduce Jian Wen Word Count over a Given Set of Web Pages see bob throw see spot

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

ALERT installation setup

ALERT installation setup ALERT installation setup In order to automate the installation process of the ALERT system, the ALERT installation setup is developed. It represents the main starting point in installing the ALERT system.

More information

Getting Started with Amazon EC2 Management in Eclipse

Getting Started with Amazon EC2 Management in Eclipse Getting Started with Amazon EC2 Management in Eclipse Table of Contents Introduction... 4 Installation... 4 Prerequisites... 4 Installing the AWS Toolkit for Eclipse... 4 Retrieving your AWS Credentials...

More information

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming 10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming Due: Friday, Feb. 21, 2014 23:59 EST via Autolab Late submission with 50% credit: Sunday, Feb. 23, 2014 23:59 EST via Autolab Policy on Collaboration

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

More information

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?

More information

Assignment 2: More MapReduce with Hadoop

Assignment 2: More MapReduce with Hadoop Assignment 2: More MapReduce with Hadoop Jean-Pierre Lozi February 5, 2015 Provided files following URL: An archive that contains all files you will need for this assignment can be found at the http://sfu.ca/~jlozi/cmpt732/assignment2.tar.gz

More information

Running Knn Spark on EC2 Documentation

Running Knn Spark on EC2 Documentation Pseudo code Running Knn Spark on EC2 Documentation Preparing to use Amazon AWS First, open a Spark launcher instance. Open a m3.medium account with all default settings. Step 1: Login to the AWS console.

More information

Package hive. January 10, 2011

Package hive. January 10, 2011 Package hive January 10, 2011 Version 0.1-9 Date 2011-01-09 Title Hadoop InteractiVE Description Hadoop InteractiVE, is an R extension facilitating distributed computing via the MapReduce paradigm. It

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

FileMaker 14. ODBC and JDBC Guide

FileMaker 14. ODBC and JDBC Guide FileMaker 14 ODBC and JDBC Guide 2004 2015 FileMaker, Inc. All Rights Reserved. FileMaker, Inc. 5201 Patrick Henry Drive Santa Clara, California 95054 FileMaker and FileMaker Go are trademarks of FileMaker,

More information

Waspmote IDE. User Guide

Waspmote IDE. User Guide Waspmote IDE User Guide Index Document Version: v4.1-01/2014 Libelium Comunicaciones Distribuidas S.L. INDEX 1. Introduction... 3 1.1. New features...3 1.2. Other notes...3 2. Installation... 4 2.1. Windows...4

More information

CommandCenter Secure Gateway

CommandCenter Secure Gateway CommandCenter Secure Gateway Quick Setup Guide for CC-SG Virtual Appliance and lmadmin License Server Management This Quick Setup Guide explains how to install and configure the CommandCenter Secure Gateway.

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Apache Hadoop new way for the company to store and analyze big data

Apache Hadoop new way for the company to store and analyze big data Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Introduction to Apache Pig Indexing and Search

Introduction to Apache Pig Indexing and Search Large-scale Information Processing, Summer 2014 Introduction to Apache Pig Indexing and Search Emmanouil Tzouridis Knowledge Mining & Assessment Includes slides from Ulf Brefeld: LSIP 2013 Organizational

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Analysis of Compression Algorithms for Program Data

Analysis of Compression Algorithms for Program Data Analysis of Compression Algorithms for Program Data Matthew Simpson, Clemson University with Dr. Rajeev Barua and Surupa Biswas, University of Maryland 12 August 3 Abstract Insufficient available memory

More information

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved. Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework

More information

CS 378 Big Data Programming

CS 378 Big Data Programming CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Tutorial for Assignment 2.0

Tutorial for Assignment 2.0 Tutorial for Assignment 2.0 Web Science and Web Technology Summer 2012 Slides based on last years tutorials by Chris Körner, Philipp Singer 1 Review and Motivation Agenda Assignment Information Introduction

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster Integrating SAP BusinessObjects with Hadoop Using a multi-node Hadoop Cluster May 17, 2013 SAP BO HADOOP INTEGRATION Contents 1. Installing a Single Node Hadoop Server... 2 2. Configuring a Multi-Node

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

MySQL and Virtualization Guide

MySQL and Virtualization Guide MySQL and Virtualization Guide Abstract This is the MySQL and Virtualization extract from the MySQL Reference Manual. For legal information, see the Legal Notices. For help with using MySQL, please visit

More information

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala

More information

Raima Database Manager Version 14.0 In-memory Database Engine

Raima Database Manager Version 14.0 In-memory Database Engine + Raima Database Manager Version 14.0 In-memory Database Engine By Jeffrey R. Parsons, Senior Engineer January 2016 Abstract Raima Database Manager (RDM) v14.0 contains an all new data storage engine optimized

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage Release: August 2011 Copyright Copyright 2011 Gluster, Inc. This is a preliminary document and may be changed substantially prior to final commercial

More information

Using The Hortonworks Virtual Sandbox

Using The Hortonworks Virtual Sandbox Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop This work by Hortonworks, Inc. is licensed under a Creative Commons Attribution- ShareAlike3.0 Unported License. Legal Notice Copyright 2012

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information