Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

Size: px
Start display at page:

Download "Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables"

Transcription

1 Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta, Egypt tahany@f-eng.tanta.edu.eg Hatem M. Abdullkader Information System department, Faculty of Computers and Information- Menofia University, Egypt Hatem683@yahoo.com Alsayed Abdelhameed Sallam Computer and Automatic Control Dept- Faculty of Engineering- Tanta University Tanta, Egypt Sallam@f-eng.tanta.edu.eg Abstract N-grams are a building block in natural language processing and information retrieval. It is a sequence of a string data like contiguous words or other tokens in text documents. In this work, we study how N-gram can be computed efficiently using a MapReduce for distributed data processing and a distributed database named Hbase This technique is applied to construct the training and testing processes using Hadoop MapReduce framework and Hbase. We proposed to focus on the time cost and storage size of the model and exploring different structures of Hbase table. By constructing and comparing a different table structures on training 1 million words for unigram, bigram and trigram models we suggest a table based on half ngram structure is a more suitable choice for distributed language model. The results of this work can be applied in the cloud computing and other large scale distributed language processing areas. Keywords Natural Language Processing, N-gram model, MapReduce, Hadoop framework, Hbase tables. I. INTRODUCTION N-gram language model is widely used in different fields like machine translation and speech recognition in natural language processing NLP. N-gram model is trained using large sets of unlabelled texts to estimate a probability model. Generally, when data increased the model will be more robust. We can get huge training texts from corpus or web, the storage space and computational power of single computer are both limited. In the distributed NLP, most of the previous works mainly concentrated on establishing the methods for model training and testing, but there is no discussion about the storage structure. The efficient construction and storage of the model in a distributed cluster environment is the main objective of this work. Also, the main interest is to explain the capability of constructing a different database table structures with language models. Organization. Section II the architecture reviews and the concepts of N-gram model. Sections III introduce the definition of Hadoop MapReduce framework and Hbase distributed database. In Section IV gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed. In Section V describes the evaluation method on the application we use. In Section VI shows all the experiments and results. Finally, Section VII the conclusion. II. N-GRAM Ngram model is the leading method for statistical language processing [1]. The model tries to expect the next word w n depending on the given n-1 words context w 1 ;w 2 ;.;w n-1 by estimating the probability function: P(w n w 1,.,w n-1 ). (1) A Markov assumption is applied that only the prior local context affects on the next word [2]. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: P =. (2) where f (w 1,.,w n ) is the counts of how many times we seen the sentence w 1,.,w n in the corpus. To adjust the empirical counts collected from training texts to the expected counts for N-grams we use one important aspect which is the count smoothing. If there are N-grams don t appear in the training set, but are seen in testing text, the probability would be zero according to the maximum likelihood estimation. So we need to find a better estimation for the expected counts. A Good-Turing smoothing is based on the count of ngram counts and used when the probability of frequency is zero. The expected counts are adjusted with the formula: C = + 1. (3) where c is the actual counts, Nc is the count of N-grams that occur exactly c times. For words that are seen frequently, c can be large numbers, and Nc+1 is likely to be zero, in this case derivations can be made to look up Nc+2, Nc+5, find a nonzero count and use that value instead. If there is a gram that doesn t appear in training set, we estimate the probability of n-1 gram instead: PDC-58

2 = (4) where w w is the back-off weight. Modern backoff smoothing techniques like Kneser-Ney smoothing [2] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate if the language model is good, we evaluate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one N-gram prediction: =,,,. (5) Perplexity is defined based on cross entropy: = 2. (6) III. HADOOP MAPREDUCE FRAMEWORK AND HBASE Hadoop is an open source framework for coding and running distributed applications that process a massive amount of data [3]. MapReduce is a programming model which supports the framework model [4]. When loading, the file is dividing into multiple data blocks with the same size, typically 64MB blocks named FileSplits, and to guarantee fault-tolerance each block is triplicated [5]. A general MapReduce architecture can be illustrated as Figure M A P 3 M A P Reduce Fig. 1 MapReduce Architecture. 1 Results For input text files, each line is parsed as one string which is the value. For output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs are text files, and what we will finally get from Hadoop are ngram/probability pairs [1]. M AP Reduce Hbase is the Hadoop database, designed to provide random, realtime read/write access to very large tables, billions of rows and millions of columns. Hbase is an opensource, distributed, versioned, column-oriented store modeled after Google's Bigtable [6]. To store the N-gram probabilities in database table, we can make use Hbase table structure such in state of parsing files. So the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. Hbase stores data in labeled tables. An imaginable view of Hbase table can be illustrated below: A format of the column name is a string with the <family>:<label> form, where <family> is a column family name assigned to a group of columns, and label can be any string. The concept of column families is that only administrative operations can modify family names, but user can create arbitrary labels on demand [1]. There is a relationship between Hadoop, Hbase and H can be illustrated as Figure 2. HBase H Application Programming Interface API output input Hadoop Fig. 2 The relationship between Hadoop, Hbase and H. IV. VALUATION N-GRAM MODEL BASED ON MAPREDUCE The distributed training process explained in Google s work [7] is split into three phases: convert words to ids, show ngrams per sentence, and compute the probabilities of ngram. We proposed an additional step to calculate the count of ngram counts by using Good-Turing Smoothing estimation. So, there are four steps in the training process shows in figure 3. The testing process is also using MapReduce, acting like a distributed decoder, so we can process multiple testing texts together. The target is to focus on the comparison of time and space using different table structures. So we specified which type of a table structure is more robust than anthers. Because we are using different table structures, we partition the training process into two phases: the first one is generating word counts and collecting Good-Turing smoothing parameters, the second one is generating Hbase table. The first step is the same for all different Hbase tables, so we focus only on the second step for comparison. PDC-59

3 combine and reduce function are similar from the previous step. The MapReduce job is explained in figure 4. Fig. 3 The flow chart of the training process. A. Part one: 1) Generate word count In this step we find the total number of each word in the different n-grams. The input of the map function is one text line. The key is the doc-id, and the value is the text. For each line, it is split into all the unigrams, bigrams, and trigrams up to n-grams [1]. Then a combiner function sums up all the values for the same key within Map tasks. And a reduce function same as the combiner collects all the output from combiner, sums up values for the same key. The final key is the same with map function s output, which is the ngram, and the value is the raw counts of this ngram throughout the training data set. Figure 4 shows the mapcombine-reduce functions process given some sample text. Fig. 4 Generate word count. 2) Generate count of n-gram counts We need to collect all the count of counts for unigram, bigram, trigram up to ngram to be able for calculate Good- Turing Smoothing count. We emits one count of the word count along with the ngram order in the map function.the combine and reduce function merge all the count of counts with the same key. Single file is enough to store all the count of counts because the final output is small. The Fig. 5 Generate count of n-gram counts. 3) Generate Good-Turing Smoothing count In this step we need to calculate the smoothed count only for the words that are never seen before and also for the words have one frequency. We use equation 3 to compute the smoothed count. The inputs of map function are a word count and a count of n-gram counts. The count of counts is stored in hash file. Figure 6 shows the flow chart of this step. Fig. 6 Generate Good-Turing Smoothing Count 4) Generate n-gram probability To estimate the probability of one n-gram,,,, we need the counts of,,, and,,,. We mean if we what to know the probability of any word we must know the count of the previous words that appears with this word. Only we need the history of the words when we deal with the highest n-gram order, from bigram up to n- gram. The zero frequency and unigram only needs the Good- Turing Smoothing to evaluate the probability. Now the reduce function can receive both the counts of the context and the counts with current word. To computes the conditional probability for ngram based on the equation 2. After Good-Turing smoothing, some counts might be quite small, so the probability might be over 1.. In this case we need to adjust it down to 1. by using a back-off model [8]. In this step, we get all the probabilities for ngrams. So the next important step is to store these probabilities in the Hbase table. B. Part two: Generate Hbase tables Hbase can be used as the data input/output sink in Hadoop MapReduce jobs [9]. There are essential modifications PDC-6

4 needed in the Hbase tables. Because writing Hbase table is row based, each time we need to generate a key with some context as the column [1]. There are several different structure choices when building tables. One of structure is a simple scheme of each ngram per row. Also there are structured rows based on current word and context. Two major points are considered: the speed in write/query and the table storage size [11]. A) N-gram based structure Each n-gram is stored in a separate row, so the table has a flat structure with one single column. The key is n-gram itself, and the column stores its probability. Table 1 is an example of this structure. This structure is easy to implement and maintain. But the table is not capable to represent all the same probabilities in an efficient way. TABLE 1 N-GRAM BASED TABLE STRUCTURE B) Current word based structure For all the ngrams that share the same current word, we can store them in one row with the key. In this table we reduce the number of rows, and expand all contexts into separate columns. From the view of distributed database, the data is stored sparsely. But in the data structure, it is still uncompressed. Table 2 shows this table structure. We also need to collect the probability of unigram for each current word and store it in a separate column. TABLE 2 CURRENT WORD BASED STRUCTURE key gt:unigram gt:there Column family : label gt:there is gt:there is a gt:car there.13 is a Key Column family (gt:prob) There.43 There is.76 There is a.33 car.11.. C) Context word based structure Similar to current word based structure, we can also use the context as the key for each row, and store all the possible following words in separate columns with the format <column family:word>. There are a lot of rows in this structure than the previous structure. To avoid redundancy, only we stored unigram keys with their own probabilities in <gt:unigram>. May be there are many columns that only appear once and have the same value. In higher order ngrams, possible compression can be made to combine these columns together, reducing the column splits. Table 3 shows example of this structure. TABLE 3 CONTEXT WORD BASED STRUCTURE D) Half n-gram based structure We can aggregate the word based and context based structure together, balancing the number of rows and columns. Our method is to partitioning the n grams into n=2 grams, and make n=2 gram as the row key, the rest n=2 gram as the column label. This structure can reduce lots of rows and insert them as columns. Table 4 shows this structure. TABLE 4 HALF N-GRAM BASED STRUCTURE key key Column family : label gt:unigram gt:there gt:is gt: a gt:car there There is a a car.65.. Column family : label gt:unigram gt:there gt:is gt: a gt:white car gt:red car there There is a a car V. EVALUATION METHOD OF APPLICATION Our major interest includes two parts: the first part about how to achieve the evaluation process on the training process, the second part how to create Hbase tables. We use Hadoop framework to run 1 million words from British National Corpus. The first part on the training process is generating word counts, count of n-gram counts and collecting Good-Turing smoothing parameters. The result in this part will be discussed on the next section. The second part is generating different Hbase tables to store the n-gram probability. Also make a comparison between the different tables. The comparison of time cost is based on the average value of program running time, taken from multiple runs to avoid deviations. The next section explains the experiment results. The data of corpus is constructed as one sentence per line. For completeness we include all the punctuation in the sentence as the part of n-gram, e.g. <the house s door> is parsed as <the> <house> < s> <door>. Each table structure is compared with the same n-gram order; also different n-gram order for one table is compared to see the relationship between the order and the time cost in each method. PDC-61

5 VI. EXPERIMENT AND RESULTS The experiments are done in a single node Hadoop cluster environment. The working nodes are running Hadoop, H and Hbase tables. We repeat the experiment three times and choose the average value as the result. The results are shown as figures and tables for different ngram orders of all the table structures. The time and space cost also explained in figures and tables for different ngram orders. The data is taken from British National Corpus, which are around 15 million words. The file size for training data is around 75 MB, which contain both the words and the punctuation. We choose to train up to 4-gram model for the training process. The unique ngrams number for each order on the training data set is explained in Table 5. TABLE 5 NUMBERS OF NGRAMS FOR DIFFERENT ORDERS We can see when the token data are relatively small, the time cost is nearly the same, but when the tokens become large, the difference between unigram, bigram and trigram model is largely increasing. Another affecting observed, when the ngram order increases, the input records also become larger, requiring more processing time. Figure 6 shows the time cost in word counts and parameter estimation of Good-Turing Smoothing for different n-gram orders. time cost (sec) unigram bigram trigram 4-gram n-gram order Fig. 6 Time cost of word count and parameter estimation. Figure 5 draws the unique ngram number for training data. We can see from the figure that the training data has a sharper line. When the order of n-gram increased the data token increased, which means more varieties. counts of each gram 12,, 1,, 8,, 6,, 4,, 2,, unigram bigram trigram 4-gram n-gram order Fig. 5 N-gram unique numbers. Now we evaluate the first part, word count and Good- Turing Smoothing parameters, for each different table structure. Table 6 shows the time and space costs of part one. All the word counting outputs are stored in H as compressed binary files. TABLE 6 TIME COST IN WORD COUNT AND PARAMETER ESTIMATION Also in Table 6 the size of word counts file for each ngram order is increasing rapidly, implying the size of the table will also increase sharply. Figure 7 shows the space size of 15 million words token. space size (KB) unigram bigram trigram 4-gram n-gram orders Fig. 7 Space size of word count and parameter estimation. In the second part, we create Hbase tables in four different structures and compare time and space cost in different n-gram orders. We generate the tables with unigram, bigram, trigram and 4-gram model for each of the four table structures. Table 7 and 8 show the time and space cost for training process for each order in generating table. TABLE 7 TIME COST IN DIFFERENT TABLE STRUCTURES PDC-62

6 TABLE 8 SPACE COST IN DIFFERENT TABLE STRUCTURES Based on the size of each directory of the table in H we calculate the size of different table structure. There are errors caused by network latency, transmission failure, I/O delay and error, the time cost may vary with an error range of minutes. Figure 8 and figure 9 compares the time and space cost for these four table types. All the tables are constructed with same compression option in Hbase. In figure 8 the unigram models in all of the four tables behave similarly, because the structures are identical. For bigram models, the training time keeps almost the same. But in the trigram and 4-gram the time cost will be increased because the data increased. Time cost (sec) Fig. 8 Time cost in different tables. In figure 9, the space cost keeps reducing through type 3 but type 2 slightly increases in the space cost. In 4-gram the space cost decreased in type 3 and almost the same on the other types. Type 2 and 4 all reduce the time cost, type 2 have little increase in the space cost but type 3 reduce the space cost. Spase cost (KB) Type of Hbase table unigram bigram trigram 4-gram unigram bigram trigram 4-gram VII. CONCLUSION This bibliographical study focuses on the management of natural language process based on Hadoop framework. From our search result; a good choice for distributed language model using Hadoop and Hbase table is the half ngram based structure. The n-gram based structure is very simple to take advantage of distributed features when considering the time and space cost. When the n-gram increased the time and space cost will be increased also. We find that context based structure is better than word based context. This is because the rows are less expensive than more columns. When we have fewer columns that means less column splits and this will reduce the I/O requests. The half ngram based structure is a best choice compared with context and word structure. If we expand the cluster to more machines the difference of time cost will be smaller. We can make a good model using a table structure which holds more data. As a future work the experiments can be expanded to a cluster with more machines, which increase the computational power and reduce the time cost. REFERENCES. [1] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation, In Proceedings of the 27 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages , 27. [2] R. Kneser and H. Ney, Improved backing-off for m-gram language modeling, Acoustics, Speech, and Signal Processing, ICASSP- 95., 1995 International Conference on, 1: vol.1, 9-12 May [3] C. Lam, Hadoop in Action, Manning Publications Co. Copyright 211, pages [4] K. Lee, H. Choi, B. Moon, Parallel Data Processing with MapReduce: A Survey, SIGMOD Record, Vol. 4, No. 4, December 211. [5] T. M. Allam; H. M. Abdullkader; A. A. Sallam, Cloud Data Management based on Hadoop Framework, The International Conference on Computing Technology and Information Management, Dubai, 214 (SDIWC), April 214, pp [6] Apache HBase project: [7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, In OSDI 6, pages , 26. [8] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language modeling, Acoustics, Speech and Signal Processing, 27. ICASSP 27. IEEE International Conference on, 4:IV 37 IV 4, 15-2 April 27. [9] Y. Zhang, A. S. Hildebrand, and S. Vogel, Distributed language modeling for nbest list re-ranking, In Proceedings of the 26 Conference on Empirical Methods in Natural Language Processing, pages , Sydney, Australia, July 26. [1] Y. Harter, D. Borthakur, S Dong, A. Aiyer, L. Tang, Analysis of H Under HBase: A Facebook Messages Case Study, University of Wisconsin, Madison, 212. [11] Y. Xiaoyang, Estimating language model using hadoop and Hbase, Master of Artificial Intelligence, University of Edinburgh, 28. Type of Hbase table Fig. 9 Space cost in different tables. PDC-63

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Perplexity Method on the N-gram Language Model Based on Hadoop Framework 94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam

More information

Estimating Language Models Using Hadoop and Hbase

Estimating Language Models Using Hadoop and Hbase Estimating Language Models Using Hadoop and Hbase Xiaoyang Yu E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008

More information

Chapter 7. Language models. Statistical Machine Translation

Chapter 7. Language models. Statistical Machine Translation Chapter 7 Language models Statistical Machine Translation Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house

More information

CS 533: Natural Language. Word Prediction

CS 533: Natural Language. Word Prediction CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Large Language Models in Machine Translation

Large Language Models in Machine Translation Large Language Models in Machine Translation Thorsten Brants Ashok C. Popat Peng Xu Franz J. Och Jeffrey Dean Google, Inc. 1600 Amphitheatre Parkway Mountain View, CA 94303, USA {brants,popat,xp,och,jeff}@google.com

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

BIG DATA IN SCIENCE & EDUCATION

BIG DATA IN SCIENCE & EDUCATION BIG DATA IN SCIENCE & EDUCATION SURFsara Data & Computing Infrastructure Event, 12 March 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: Jimmy Lin & http://en.wikipedia.org/wiki/mount_everest

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc.

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc.gr Joining Cassandra Binjiang Tao Computer Science Department University of Crete Heraklion,

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 7, July 2014, pg.759

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems Storage of Structured Data: BigTable and HBase 1 HBase and BigTable HBase is Hadoop's counterpart of Google's BigTable BigTable meets the need for a highly scalable storage system for structured data Provides

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

On the Varieties of Clouds for Data Intensive Computing

On the Varieties of Clouds for Data Intensive Computing On the Varieties of Clouds for Data Intensive Computing Robert L. Grossman University of Illinois at Chicago and Open Data Group Yunhong Gu University of Illinois at Chicago Abstract By a cloud we mean

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

A High-availability and Fault-tolerant Distributed Data Management Platform for Smart Grid Applications

A High-availability and Fault-tolerant Distributed Data Management Platform for Smart Grid Applications A High-availability and Fault-tolerant Distributed Data Management Platform for Smart Grid Applications Ni Zhang, Yu Yan, and Shengyao Xu, and Dr. Wencong Su Department of Electrical and Computer Engineering

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Market Basket Analysis Algorithm on Map/Reduce in AWS EC2 Jongwook Woo Computer Information Systems Department California State University Los Angeles jwoo5@calstatela.edu Abstract As the web, social networking,

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Cassandra A Decentralized, Structured Storage System

Cassandra A Decentralized, Structured Storage System Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922

More information

Criteria to Compare Cloud Computing with Current Database Technology

Criteria to Compare Cloud Computing with Current Database Technology Criteria to Compare Cloud Computing with Current Database Technology Jean-Daniel Cryans, Alain April, and Alain Abran École de Technologie Supérieure, 1100 rue Notre-Dame Ouest Montréal, Québec, Canada

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Query and Analysis of Data on Electric Consumption Based on Hadoop

Query and Analysis of Data on Electric Consumption Based on Hadoop , pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing Jongwook Woo Computer Information Systems Department California State University Los Angeles, CA Abstract Map/Reduce approach has been

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Slave. Master. Research Scholar, Bharathiar University

Slave. Master. Research Scholar, Bharathiar University Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper online at: www.ijarcsse.com Study on Basically, and Eventually

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

JackHare: a framework for SQL to NoSQL translation using MapReduce

JackHare: a framework for SQL to NoSQL translation using MapReduce DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data

Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data Ms.N.Saranya, M.E., (CSE), Jay Shriram Group of Institutions, Tirupur. charanyaa19@gmail.com Abstract- Load balancing is biggest

More information

Strategies for Training Large Scale Neural Network Language Models

Strategies for Training Large Scale Neural Network Language Models Strategies for Training Large Scale Neural Network Language Models Tomáš Mikolov #1, Anoop Deoras 2, Daniel Povey 3, Lukáš Burget #4, Jan Honza Černocký #5 # Brno University of Technology, Speech@FIT,

More information

INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014

INTRO TO BIG DATA. Djoerd Hiemstra. http://www.cs.utwente.nl/~hiemstra. Big Data in Clinical Medicinel, 30 June 2014 INTRO TO BIG DATA Big Data in Clinical Medicinel, 30 June 2014 Djoerd Hiemstra http://www.cs.utwente.nl/~hiemstra WHY BIG DATA? 2 Source: http://en.wikipedia.org/wiki/mount_everest 3 19 May 2012: 234 people

More information

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce , pp.231-242 http://dx.doi.org/10.14257/ijsia.2014.8.2.24 A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce Wang Jin-Song, Zhang Long, Shi Kai and Zhang Hong-hao School

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Policy-based Pre-Processing in Hadoop

Policy-based Pre-Processing in Hadoop Policy-based Pre-Processing in Hadoop Yi Cheng, Christian Schaefer Ericsson Research Stockholm, Sweden yi.cheng@ericsson.com, christian.schaefer@ericsson.com Abstract While big data analytics provides

More information

Comparative analysis of Google File System and Hadoop Distributed File System

Comparative analysis of Google File System and Hadoop Distributed File System Comparative analysis of Google File System and Hadoop Distributed File System R.Vijayakumari, R.Kirankumar, K.Gangadhara Rao Dept. of Computer Science, Krishna University, Machilipatnam, India, vijayakumari28@gmail.com

More information

Efficient Data Replication Scheme based on Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System , pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

Big Table A Distributed Storage System For Data

Big Table A Distributed Storage System For Data Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information