Comparison of Different Implementation of Inverted Indexes in Hadoop

Size: px
Start display at page:

Download "Comparison of Different Implementation of Inverted Indexes in Hadoop"

Transcription

1 Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki, ABSTRACT There is a growing trend of applications that need to handle Big Data, as many corporations and organizations are required to collect more data from their operations. Recently, processing Big Data, using MapReduce structure has become very popular, because the traditional data warehousing solutions for handling such datasets are not feasible. Hadoop provides an environment for execution of MapReduce program over distributed memory clusters, which supports the processing of large datasets in a distributed computing environment. Information retrieval systems facilitate searching of the content of the books or journals based on the metadata or indexing. An inverted index is a data structure which stores a mapping from content, such as words or numbers, to its locations in one or more documents. In this paper we propose three different implementations for inverted indexes (Indexer, IndexerCombiner, and IndexerMap) in Hadoop environment using MapReduce programming model, and compare their performance to evaluate the impacts of different factors such as data format and output file format in MapReduce. KEYWORDS Big Data; Cluster; E-book; Hadoop; Inverted Index. 1 INTRODUCTION The proliferation of social networking sites combined with seamless interconnection of many everyday basic services such as healthcare, etc., has caused the accumulation of large amount of data. Therefore, dealing with terabytes or even petabytes of datasets is becoming a necessary daily processes for many companies [1, 2]. These companies are building environments for handling Big Data processing, so that they can remain in business. With the development of Cloud and IT technologies, processing Big Data needs more advanced and sophisticated applications. Therefore, to process Big Data, the new programming models such as the Google MapReduce [3] and Hadoop [4] are gaining more popularity with their speedy and efficient data intensive processing abilities [3]. MapReduce is one of the most popular programming models for processing large datasets. Hadoop is the Apache s free and open source implementation of MapReduce. The Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches in which one or more passes are made over a static set of input data. Therefore, at the completion of the process, standard indexing method support search queries in a variety of content-based applications [5]. Most modern search engines use some form of inverted indexes to process users submitted queries. In its most basic form, an inverted index is a simple hash table, which maps words in the documents to some sort of document identifiers [4]. In this paper, we implemented three different inverted indexes (Indexer, IndexerCombiner, and IndexerMap) for the Electronic Documents (e.g., E-Books) using the MapReduce paradigm in Hadoop environment. The main goal is to identify and evaluate different factors that have impact on the execution time of these three implementations. 2 PROBLEM DESCRIPTIONS An efficient method for searching the contents of the E-books has become necessary with the everincreasing amount of usage of electronic readers. Given a user query, a searching method searches ISBN: SDIWC 52

2 terabytes of textual data, retrieves the relevant pages, and sends them back to the user. However, this strategy for processing several large documents is problematic, as retrieval of relevant information in Big Data environment is a huge computational task. It may take several days to read in the huge data using a stand-alone computer. This problem can be avoided and Big Data can be processed more efficiently using the large clusters of computers. However, in large clusters computers may fail occasionally and data can be lost or corrupted. Therefore, programmers are concerned about a variety of processing errors, communication, etc. While, in the Hadoop environment, files are divided into uniform blocks and are distributed across clusters of nodes. The Hadoop File System (HDFS) performs the data block placement in different nodes, and the information which is stored in the data blocks, is replicated to handle hardware failure replication for performance and fault tolerance. Also, HDFS keeps checksums of data for corruption detection and recovery. 3 RELATED WORKS Information retrieval is the process of finding needed relevant information from a set of information resources. By using an information retrieval system, one can search the content of books, journals, or other documents based on metadata or on full-text indexing [4]. Therefore, information retrieval processes are mainly organized into two main processes: indexing and retrieval. The indexing process involves preprocessing of a collection of documents and storing the description of it as an index. On the other hand, the retrieval process involves issuing a query by accessing an index to find documents relevant to the query. To handle queries, search engines need fast access to all documents containing the set of search terms [5]. 3.1 Inverted Index An inverted index is an index structure that stores a mapping from the content to the location in a file [6]. The inverted index is variety of applications such as in document retrieval systems (e.g. search engine systems) [5, 6]. The inverted index enhances management and retrieval time for huge amount of information [6]. There are two main variants of inverted indexes [7]: 1) The inverted file index This type of index contains list of references to documents for each word. 2) The inverted list index This type of index contains the positions of each word within a document. Also, an inverted index can contain position of each word or term within a document [5]. Figure 1 shows the structure of an inverted index. words word1 word2 word3 postings Figure1. Simple design of an inverted index This structure contains one column for the words or terms in the document or the series of documents and another column for postings list. A postings list, as shown in Figure 1, is comprised of individual postings, each of which consists of a document id and the information about occurrences of the words or terms in the document/s (payload). In a simple inverted index, no additional information is needed in the postings column other than the document id; the existence of the postings itself indicates that occurrence of the word or term in the document. The most common payload, however, is the number of times a word or term occurs in the document (i.e. term frequency). More complex payloads also include positions of every occurrence of the word in the document. Also, properties of the word (such as if it occurred in a specific page or not) allow document ranking based on specific characteristic or notions of importance. In this paper payload indicates the word frequency and the location of the word in the document [3]. 3.2 Big Data did, p did, p did, p did, p did, p did, p did, p did, p did, p did, p did, p did, p Traditionally large computational problems are best tackled and solved through divide and concur approach. The basic idea is to partition the ISBN: SDIWC 53

3 problem into smaller sub problems. The sub problems then can be distributed to multiple machines to be processed in parallel [3, 4]. The same strategy can be applied to process Big Data. Although, this strategy will improve the performance, the failure of one machine in such environment can jeopardize the result. In a distributed environment, data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Different nodes may not synchronize. Therefore, the lock on different files may not be released on time. The nodes involved in distributed atomic transactions may lose their network connections, etc. However, in each of these cases, the proper mechanisms have been provided for a distributed system to recover from these failures or transient error conditions and continue to make progress. 3.3 MapReduce In recent years, to process Big Data, using MapReduce programming has become progressively more popular. Google uses MapReduce in hundreds of applications on thousands of machines with terabytes of data [3]. MapReduce is a programming model for doing distributed computations on huge data sets. It is an execution framework for large data processing on clusters of computers [8, 9]. It was originally developed by Google based on the principles of parallel and distributed processing concepts [8, 9]. MapReduce has become popular via its opensource implementation called Hadoop, which has been developed by Yahoo (now an Apache project) [4]. One of the most important ideas in MapReduce environment is to separate the distributed processing, referred to as a job, from other related execution activities. The programmer submits the job to the submission node of a cluster (in Hadoop, it is named the job tracker [4]). Then the execution framework (runtime) takes care of everything, namely: it handles all other aspects of distributed code execution, ranging from one node to a few hundred of nodes. Each job is divided into smaller parts called tasks. For example, a map task may be responsible for processing a certain block of input key-value pairs (called an input split in Hadoop). Similarly, a reduce task may handle a portion of the intermediate key space. MapReduce also has necessary properties regarding fault tolerance. Since the jobs are executed independently, then they do not depend on other jobs for intermediate results. If the system detects failures on the map tasks which is running by a node (worker node), then those tasks can be re-executed. This provides the MapReduce flexibility to handle the large-scale worker failure [3]. One of the most significant advantages of MapReduce is that, it hides many system-level details from the programmers. Therefore it reduces the complexity of processing large sets of data. The synchronization of MapReduce also is another great feature. In general, synchronization refers to the mechanisms by which multiple concurrently running processes join together. Therefore, MapReduce achieves flexible scalability through organizing the blocks of data in different nodes in a cluster. The runtime system automatically divides the input dataset into equalsized data blocks (as much as possible) and dynamically sends each data block to an available compute node for execution. A MapReduce program consists of two components: one component that implements the mapper function and another component that implements the reducer function. The first phase of a MapReduce program is called mapping. A list of data elements is provided, one at a time, to a Mapper function, which transforms each element of this input data to an output data. That is, the Mapper takes an input pair and produces a set of intermediate keyvalue pairs. To do the processing on a dataset using MapReduce, the programmer defines the mapper and reducer functions as below [9]: Map: (k1; v1) [(k2; v2)] Reduce: (k2; [v2]) [(k3; v3)] Every value (v) has a key (k) associated with it. Therefore, each key identifies a value, and the key-value pairs form the basic data structure in MapReduce. The keys and values may be primitives such as integers, floating-point values, ISBN: SDIWC 54

4 strings, or they may be complex structures. Programmers usually need to define their own data types, although a number of other libraries such as Protocol Buffers, Thrift, and Avro may simplify the task. 3.4 Hadoop Limitations Generally each node in a Hadoop cluster may typically have a few gigabytes of memory. However, if the input dataset is several terabytes, then this would require a thousand or more computers to hold the data in the RAM, and then, no single machine would be able to process all of the data. Although, hard drives are much bigger, and a single machine can these days hold multiple terabytes of information on its hard drives. The intermediate result generated from a data processing task in a large-scale computation can quickly fill up more space than what the original input data has required. Also, during data processing, some of the hard drives employed by the system may become full, and the system may need to route this data to other nodes that can store the overflow. Finally, bandwidth is limited even for an internal network. While a set of nodes are directly connected by a high capacity network link, when all of the nodes try to transmit multi-gigabyte of data, they could easily saturate the network capacity. Furthermore, the remote procedure calls and other data transfer requests which are using a channel may be delayed or simply dropped. To maintain successfully a robust distributed system, a large-scale distributed system must be able to manage its resources efficiently. For example, it must allocate some of its resources for maintaining the system as a whole, while devoting as much time as possible to the computations. 4. IMPLEMENTATION This section presents the implementation of the E- Book content indexing system, which represents the inverted index implementation using MapReduce within the Hadoop environment. In this section we show three different implementations of inverted index in MapReduce: Indexer, IndexerCombiner, and IndexerMap. 4.1 Indexer The Indexer is the simplest implementation of inverted index. Like any other MapReduce programs, this implementation consists of two classes (Map and Reduce). As for the input of mapper class, Hadoop splits data based on the input format, which is specified in job configuration. In this project the input format is TextInputFormat, which means each file in the dataset has at least one split. Split data size should be less than 64 MB (default Block size). If a file size is more than the block size, then Hadoop splits the file into more than one split. These splits are the inputs for the mapper class. In the map class, the mapper function tokenizes the words within a file as actual text and obtains the name of the file. At the end of this process it creates <word, file name> output for counting the occurrence of each word, then it forwards the output to the reducers. The output of mapper is also called intermediate key [10]. 4.2 IndexerCombiner In the IndexerCombiner implementation, the mapper class has extra responsibility to partially reduce the outputs in the mapper phase. This reduction is called combiner. The mapper is not based on creating an output <Word, Page name> for each word that it finds. Instead, the mapper counts the occurrence of the words in each split and creates an output <word, <file name: number>> to forward it to reducer. Therefore, the numbers of intermediate keys are fewer in comparison with Indexer. The reducer class for IndexerCombiner is very similar to reducer class in Indexer. However, these reducers handle the list of values differently. 4.3 IndexerMap The Apache Hadoop s SequenceFile provides the data structure for the key-value pairs. This data structure is append-only. That means it cannot be edit or remove a specific element from the file. The MapFile is a directory that contains two SequenceFile files: the data file ( /data ) and the ISBN: SDIWC 55

5 index file ( /index ). The data file contains all the key-value pairs in a sorted manner. The Index file contains the key and a LongWritable (Hadoop defined type) which represents the starting byte position of the record. The Index file does not contain all the keys but just a fraction of the keys in order to save the memory space. SequenceFile Data Key Value Key Value Key Value MapFile Index Key Key Data Key Value Key Value Key Value Figure 2. The layout of SequenceFile and Mapfile files Similar to Indexer s reducer, the reducer creates a HashMap list and starts to count the occurrences of each file-name in the list of values. Hadoop provides some OutputFormat instances to be written to files. The basic (default) instance is TextOutputFormat, which formats the key-value pairs on the individual lines of a text file as shown in Figure 2. 5 TESTING AND RESULTS CD images which contains a large number of E- books annually. We downloaded the April 2010 dual layer DVD image that includes over text files as E-books. However, the size of each file was not more than 500 KB. The overall size of the dataset was 7 Gigabyte. The second dataset was collected from (WestBuryLab 2010) [12]. This project has been developed to serve researchers in Natural Language processing. The data that we collected from this project was the Wikipedia corpus. This corpus was created from a snapshot of all the articles in the English part of the Wikipedia that was taken in April The dataset only has a large text file which contains all of the articles. The size of this file was 6 Gigabyte. 5.2 Metrics and Measured Items In order to compare different implementations we consider the following measurements Execution time: The CPU time that takes to complete a Map/Reduce job and CPU execution time that takes to complete each phase (Mapping / Reducing) individually. The size of output file: size of resulted index files for each implementation. 5.3 Web Interfaces In this project, we developed two series of implementations with two different datasets. The experiments are done with two nodes. The reason to use two different dataset is to demonstrate the effect of data in the computation time and memory allocation. We test our implementations with 1Gigabyte of each data. The reason for using a fraction of the whole data was the lack of HDFS size in our nodes. These nodes were not dedicated to Hadoop only so they were dual boot systems with limited capacity. Then at the end we compare our results. 5.1 Datasets The first dataset was collected from the Gutenberg project [11]. This project offers over 42,000 free E-books with different format such as (text, epub, and html). The Gutenberg project offers DVD or Hadoop environment gives the user the ability to track the NameNode status, running and completed jobs, as well as, running tasks. It uses a web page as the user interface. User can simply view the status of each part by connecting to the corresponding page. We use this ability to obtain our results for this project. Additionally, it allows the user to browse the HDFS namespace and views the files inside. 6 DISCUSSIONS First we did the test on 1 Gigabyte of Gutenberg project. The dataset consists of 3500 small data. When we ran the test for each of the implementation, Hadoop launched 3500 map task (one task for each line) and 1 reducer task. Figure 3 shows the execution time of each implementation for this test. ISBN: SDIWC 56

6 Figure 3. Comparison chart for execution time using WestBuryLab dataset According to the test result, IndexerCombiner class has the least execution time, and after that IndexerMap comes to the second place. As IndexerCombiner partially reduces the output of mapper class, therefore, the reducer has less record to process and because we only have one reducer having less record would affect the reducer computation significantly. IndexerMap performs the mapper similar to Indexer. However, writing and creating Mapfile takes more time than creating text file as the output. Therefore, the total execution time is also more for Indexer. Figure 4. Execution time comparison by phase using Gutenbergs Figure 4 shows the execution time for each phase individually. Since the reducer starts while the mapper is running, adding the execution time of each phase does not give us the total execution time. Indexer and IndexerMap class spent almost the same amount of time in mapping. However, IndexerMap spent slightly more amount of time due to lack of computation power (reducing takes more memory in IndexerMap). As it is mentioned the highest amount of time in reducing belongs to IndexerMap because creating Mapfile takes more amount of time. Although IndexerCombiner performs more operations on its mapping phase, the mapping phase of IndexerCombiner takes less time than Indexer and IndexerMap. Lesser input record and load for reducing give the computation power to mapping phase and finish the job sooner. 7 CONCLUSIONS This paper presents the implementation and resulting analysis for the comparison of three different implementations of E-Book and Online document content indexing system, which utilizes the inverted index using MapReduce programming within the Hadoop environment. Specially, we compared three different implementations of the inverted index in MapReduce: Indexer, IndexerCombiner, and IndexerMap. The Hadoop environment has a huge impact on the overall performance of the inverted index. In fact using only one machine without any parallel processing for the given size of the datasets and the same tasks load that we used it would take a longer amount of time. Hadoop provides some output format instances for writing to files. The default instance is TextOutputFormat, which writes (key, value) pairs on individual lines of a text file. Another file format is SequenceFile that stores the (key, value) as binary format as shown in Figure 2. A special type of SequenceFile is a MapFile, which provides an index structure for the output, which helps to have faster access to the generated data. We realized that MapFile takes much larger space than usual text files. Overall, the idea of reducing the number of messages between mapper tasks and reducer tasks had the most effect on the execution time of the Indexer. In addition, the type of output file has an effect on the size of output. That is, the size of Mapfile output is much larger than the text output, and generation of Mapfile takes more execution time than generation of text file. ISBN: SDIWC 57

7 8 REFERENCES [1] A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu and R. Murthy, Hive - A petabyte scale data warehouse using Hadoop, in Proceedings of the International Conference on Data Engineering (ICDE 10), pp , [2] N. Dzugan,; L. Fannin,; S. K. Makki, A recommendation scheme utilizing Collaborative Filtering, in Proceedings of th International Conference for Internet Technology and Secured Transactions (ICITST), pp , [3] S. Dean, MapReduce: simplified data processing on large clusters, Communication of ACM, pp , [4] Yahoo, Apache Hadoop, [Online]. Available: ml. [Accessed June 2013]. [5] J. Zobel and A. Moffat, K. Ramamohanarao Inverted files versus signature files for text indexing, ACM Transactions on Database Systems (TODS), pp , [6] X. Liu, Efficient maintenance scheme of inverted index for large-scale full-text retrieval, in Future Computer and Communication (ICFCC), 2nd International Conference on, Wuhan, [7] X. Liu, An efficient random access inverted index for information retrieval, in Proceedings of the 19th international conference on World wide web, Raleigh, North Carolina, USA, ACM, 2010, pp [8] Dean, Jeffrey and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pp , San Francisco, California, [9] Jimmy Lin, Exploring Large Data Issues in the Curriculum: A Case Study with MapReduce, in Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL 08) at ACL 2008, p , 2008, Columbus, Ohio. [10] D. Jiang, B. C. Ooi, L. Shi, and S. Wu, "The performance of MapReduce: an in-depth study," in Procdeeings of VLDB, pp , [11] Gutenberg, [Online], [Accessed 3 June 2013], available: [12] WestBuryLab, [Online], [Accessed 3 June 2013], available: ISBN: SDIWC 58

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

Big Application Execution on Cloud using Hadoop Distributed File System

Big Application Execution on Cloud using Hadoop Distributed File System Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.

More information

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

NetFlow Analysis with MapReduce

NetFlow Analysis with MapReduce NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

THE HADOOP DISTRIBUTED FILE SYSTEM

THE HADOOP DISTRIBUTED FILE SYSTEM THE HADOOP DISTRIBUTED FILE SYSTEM Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Presented by Alexander Pokluda October 7, 2013 Outline Motivation and Overview of Hadoop Architecture,

More information

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

!#$%&' ( )%#*'+,'-#.//0( !#$%&'()*$+()',!-+.'/', 4(5,67,!-+!89,:*$;'0+$.<.,&0$'09,&)/=+,!()<>'0, 3, Processing LARGE data sets !"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION HARRISON CARRANZA Marist College APARICIO CARRANZA NYC College of Technology CUNY ECC Conference 2016 Poughkeepsie, NY, June 12-14, 2016 Marist College AGENDA Contents

More information

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION

A B S T R A C T. Index Terms : Apache s Hadoop, Map/Reduce, HDFS, Hashing Algorithm. I. INTRODUCTION Speed- Up Extension To Hadoop System- A Survey Of HDFS Data Placement Sayali Ashok Shivarkar, Prof.Deepali Gatade Computer Network, Sinhgad College of Engineering, Pune, India 1sayalishivarkar20@gmail.com

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

SARAH Statistical Analysis for Resource Allocation in Hadoop

SARAH Statistical Analysis for Resource Allocation in Hadoop SARAH Statistical Analysis for Resource Allocation in Hadoop Bruce Martin Cloudera, Inc. Palo Alto, California, USA bruce@cloudera.com Abstract Improving the performance of big data applications requires

More information

NoSQL Data Base Basics

NoSQL Data Base Basics NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Alternatives to HIVE SQL in Hadoop File Structure

Alternatives to HIVE SQL in Hadoop File Structure Alternatives to HIVE SQL in Hadoop File Structure Ms. Arpana Chaturvedi, Ms. Poonam Verma ABSTRACT Trends face ups and lows.in the present scenario the social networking sites have been in the vogue. The

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014 Hadoop and Hive Introduction,Installation and Usage Saatvik Shah Data Analytics for Educational Data May 23, 2014 Saatvik Shah (Data Analytics for Educational Data) Hadoop and Hive May 23, 2014 1 / 15

More information