Estimating Language Models Using Hadoop and Hbase

Size: px
Start display at page:

Download "Estimating Language Models Using Hadoop and Hbase"

Transcription

1 Estimating Language Models Using Hadoop and Hbase Xiaoyang Yu E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008

2 Abstract This thesis presents the work of building a large scale distributed ngram language model using a MapReduce platform named Hadoop and a distributed database called Hbase. We propose a method focusing on the time cost and storage size of the model, exploring different Hbase table structures and compression approaches. The method is applied to build up training and testing processes using Hadoop MapReduce framework and Hbase. The experiments evaluate and compare different table structures on training 100 million words for unigram, bigram and trigram models, and the results suggest a table based on half ngram structure is a good choice for distributed language model. The results of this work can be applied and further developed in machine translation and other large scale distributed language processing areas. i

3 Acknowledgements Many thanks to my supervisor Miles Osborne for the numerous advices, great supports and for inspiring new ideas about this project during our meetings. I would also like to thank my parent for their trusts in me and encouragements. Thanks a lot to Zhao Rui for her great suggestions for the thesis. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Xiaoyang Yu) iii

5 Table of Contents 1 Introduction 1 2 Background Ngram Language Model Distributed Language Modeling Hadoop MapReduce Framework Hbase Database Estimating Language Model using Map Reduce Generate Word Counts Generate Count of Counts Generate Good-Turing Smoothing Counts Generate Ngram Probability Generate Hbase Table n-gram Based Structure Current Word Based Structure Context Based Structure Half Ngram Based Structure Integer Based Structure Direct Query Caching Query Evaluation Methods Time and Space for Building LM LM Perplexity Comparison Experiments Data iv

6 5.2 Ngram Order Table Structures Discussion Conclusion Future Works Bibliography 47 A Source Code 49 A.1 NgramCount.java A.2 TableGenerator.java A.3 NgramModel.java v

7 Chapter 1 Introduction In statistical natural language processing, ngram language model is widely used in various areas like machine translation and speech recognition etc. Ngram model is trained using large sets of unlabelled texts to estimate a probability model. Generally speaking, more data gives better model. As we can easily get huge training texts from corpus or web, the computational power and storage space of single computer are both limited. Distributed language models are proposed to meet the needs of processing vast amount of data, trying to solve above problems. In the field of distributed language model processing, most of the relevant works mainly concentrated on establishing the methods for model training and testing, but no detailed discussions about the storage structure. The objective of our work is to efficiently construct and store the model in a distributed cluster environment. Moreover, our interest is to explore the capability of coping different database table structures with language models. The novel aspects of our project are as follows. We show that a distributed database can be well integrated with distributed language modeling, providing the input/output data sink for both the distributed training and testing processes. Using distributed database helps to reduce the computational and storage cost. Meanwhile, the choice of database table structure affects the efficiency a lot. We find that half ngram based table structure is a good choice for distributed language models from the aspects of time and space cost, comparing with other structures in our experiments. We use a distributed computing platform called Hadoop, which is an open source implementation of the MapReduce framework proposed by Google [3]. A distributed database named Hbase is used on top of Hadoop, providing a model similar to Google s Bigtable storage structure [2]. The training methods are based on Google s work on large language model in machine translation [1], and we propose a different method 1

8 Chapter 1. Introduction 2 using Good-Turing smoothing with a back-off model and pruning. We store the ngram model in the distributed database with 5 different table structures, and a back-off estimation is performed in a testing process using MapReduce. We propose different table structures based on ngram, current word, context, half ngram and converted integer. Using Hadoop and Hbase, we build up unigram, bigram and trigram models with 100 million words from British National Corpus. All the table structures are evaluated and compared for each ngram order with a testing set of 35,000 words. The training and testing processes are split into several steps. For each step the time and space cost are compared in the experiments. The perplexity of testing set is also compared with traditional language modeling package SRILM [6]. The results are discussed about the choice of proper table structure for the consideration of efficiency. The rest of this thesis is organised as following: Chapter 2 introduces the ngram language model, relative works on distributed language modeling, the Hadoop MapReduce framework and Hbase distributed database; Chapter 3 gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed; Chapter 4 describes the evaluation methods we use; Chapter 5 shows all the experiments and results; Chapter 6 discusses about the choice of table structure for Hadoop/Hbase and possible future works.

9 Chapter 2 Background In statistical language processing, it is essential to have a language model which measures the probability of how likely a sequence of words may occur in some context. Ngram language model is the major language modeling method, along with smoothing and back-off methods to deal with the problem of data sparseness. Estimating this in a distributed environment enables us to process vast amount of data and get a better language model. Hadoop is a distributed computing framework that can be used for language modeling, and Hbase is a distributed database which may store the model data as database tables and integrate with Hadoop platform. 2.1 Ngram Language Model Ngram model is the leading method for statistical language processing. It tries to predict the next word w n given n-1 words context w 1,w 2,...,w n 1 by estimating the probability function: P(w n w 1,...,w n 1 ) (2.1) Usually a Markov assumption is applied that only the prior local context - the last n-1 words - affects the next word. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: p(w n w 1,...,w n 1 ) = f (w 1,...,w n ) f (w 1,...w n 1 ) (2.2) where f (w 1,...,w n ) is the counts of how many times we see the sentence w 1,...,w n in the corpus. One important aspect is the count smoothing, which adjusts the empirical counts collected from training texts to the expected counts for ngrams. Considering 3

10 Chapter 2. Background 4 ngrams that don t appear in the training set, but are seen in testing text, for a maximum likelihood estimation the probability would be zero, so we need to find a better estimation for the expected counts. A popular smoothing method called Good-Turing smoothing is based on the count of ngram counts, the expected counts are adjusted with the formula: r = (r + 1) N r+1 N r (2.3) where r is the actual counts, N r is the count of ngrams that occur exactly r times. For words that are seen frequently, r can be large numbers, and N r+1 is likely to be zero, in this case derivations can be made to look up N r+2,n r+3,...n r+n, find a nonzero count and use that value instead. The idea of count smoothing is to better estimate probabilities for ngrams, meanwhile, for unseen ngrams, we can to do a back-off estimation using the probability of lower order tokens, which is usually more robust in the model. For a ngram w 1,...,w n that doesn t appear in training set, we estimate the probability of n-1 gram w 2,...,w n instead: p(w n w 1,...,w n 1 ) if found (w 1,...,w n ) P(w n w 1,...,w n 1 ) = λ(w 1,...,w n 1 ) p(w n w 2,...,w n 1 ) otherwise (2.4) where λ(w 1,...,w n 1 ) is the back-off weight. In general, back-off requires more lookups and computations. Modern back-off smoothing techniques like Kneser-Ney smoothing [5] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate the language model, we can calculate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one ngram prediction. The cross entropy is the average entropy of each word prediction in the ngram: H(p LM ) = 1 n p LM(w 1,w 2,...,w n ) = 1 n n i=1 Perplexity is defined based on cross entropy: logp LM (w n w 1,...,w n 1 ) (2.5) PP = 2 H(p LM) 1 n = 2 n i=1 logp LM (w n w 1,...,w n 1 ) (2.6)

11 Chapter 2. Background 5 Generally speaking, perplexity is the average number of choices at each word prediction. So the lower perplexity we get, the higher probability of the prediction is, meaning a better language model. 2.2 Distributed Language Modeling large scale distributed language modeling is quite a new topic. Typically using a server/client diagram as Figure 2.1, we need to efficiently manipulate on large data sets, to communicate with cluster workers and organise their works. The server controls and distributes tasks to all the workers, and clients send requests to server to execute queries or commands. Relative ideals can be seen in Distributed Information Retrieval [7]. Recent works include a method to split a large corpus into many non-overlapping chunks, and make each worker in the cluster load one of the chunks with its suffix array index [8]. The use of suffix array index helps to quickly find the proper context we want, and we can count the words occurrence in each worker simultaneously. In this kind of approach, the raw word counts are stored and served in the distributed system, and the clients collect needed counts and then compute the probability. The system is applied on N-best list re-ranking problem and shows a nice improvement on a 2.97 billion-word corpus. Figure 2.1: Server/Client Architecture A similar architecture is proposed later with interpolated models [4]. The corpus

12 Chapter 2. Background 6 is also split into chunks along with their suffix array index. Also a smoothed n-gram model is computed and stored separately in some other workers. Then the client requests both the raw word counts and the smoothed probabilities from different workers, and computes the final probability by linear weighted blending. The authors apply this approach on N-best list re-ranking and integrate with machine translation decoder to show good improvement in translation quality trained on 4 billion words. Previous two methods solve the problem to store large corpus and provide word counts for clients to estimate probabilities. Later works [1] describe a method to store only the smoothed probabilities for distributed n-gram models. In previous methods, the client needs to look up each worker to find proper context using suffix array index. On the other hand this method result in exactly one worker being contacted per n-gram, and in exactly two workers for context-depended backoff [1]. The authors propose a new backoff method called Stupid Backoff, which is a simpler scheme using only the frequencies and an empirical chosen back-off weight. f (w 1,...,w n ) if f (w f (w 1,...w n 1 ) 1,...,w n ) > 0 P(w n w 1,...,w n 1 ) = α f (w 2,...,w n ) otherwise f (w 2,...w n 1 ) (2.7) where α is set to 0.4 based on their earlier experiments [1]. The experiments directly store 5-grams model using different sources. According to their results, 1.8T tokens from web include a 16M vocabulary, and generate a 300G n-grams with a 1.8T language model with their Stupid Backoff. These previous works are the theoretical principles of our project. The last work used Google s distributed programming model MapReduce [3], but the authors didn t describe the way to store the data. We adapt the MapReduce model for our work, because it contains a clear workflow and has already been proved as a mature model and widely used in various applications by Google, Yahoo and other companies. Although Google s own implementation of MapReduce is proprietary, we can still choose open source implementation of this model. 2.3 Hadoop MapReduce Framework Hadoop is an open source implementation of MapReduce programming model. It is based on Java and uses the Hadoop Distributed File System (HDFS) to create multiple replicas of data blocks for reliability, distributing them around the clusters

13 Chapter 2. Background 7 and splitting the task into small blocks. According to their website, Hadoop has been demonstrated on 2,000 nodes and is designed up to support 10,000 node clusters, so it enables us to extend our clusters in the future. A general MapReduce architecture 1 can be illustrated as Figure 2.2. At first the input files are split into small blocks named FileSplits, and the Map operation is created parallelized with one task per FileSplit. Figure 2.2: MapReduce Architecture Input and Output types of a Map-Reduce job: (input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output) The FileSplit input is treated as a key/value pair, and user specifies a Map function to process the key/value pair to generate a set of intermediate key/value pairs [3]. When the Map operation is finished, the output is passed to Partitioner, usually a hash function, so all the pairs shared with same key can be collected together later on. After the intermediate pairs are generated, a Combine function is called to do a reduce-like job in each Map node to speed up the processing. Then a Reduce function merges all intermediate values associated with the same intermediate key and writes to output files. Map and Reduce operations are working independently on small blocks of data. The final output will be one file per executed reduce task, and Hadoop stores the output files in HDFS. For input text files, each line is parsed as one value string, so the Map function is processing at the sentence level; for output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs 1

14 Chapter 2. Background 8 are text files, and what we will finally get from Hadoop are ngram/probability pairs. It means that, theoretically we can only use Hadoop to build language model. This chapter gives a brief overview of ngram language models, the smoothing and back-off methods, the relevant works on distributed language modeling, the Hadoop framework and Hbase database. 2.4 Hbase Database Hbase is the distributed database on top of HDFS, whose structure is very similar to Google s Bigtable model. The reason we look into using Hbase with language model is, although we can do our work all in Hadoop and store all outputs as text or binary files in HDFS, it will consume more time and extra computations for parsing input records. The bottleneck of Hadoop MapReduce is that the input/output is file based, either plain text or binary file, which is reasonable for storing large amount of data but not suitable for query process. For example, if we want to query probability for one ngram, we have to load all the files into map function, parse all of them, and compare the key with ngram to find the probability value. Basically we need to do this comparison for each ngram in test texts, which will cost quite a long time. Instead of parsing files, we can make use of a database structure such as Hbase, to store the ngram probabilities in database table. The advantage is obvious: database structure is designed to meet the needs of multiple queries; the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. As in distributed database, the table is stored in distributed filesystem, providing scalable and reliable storage. Meanwhile, for language modeling, the model data is highly structured. The basic format is ngram/probability pair, which can be easily constructed into more organised and compact structures. Considering we may get huge amount of data, the compressed structures are essential from both time and storage aspects. Hbase stores data in labelled tables. The table is designed to have a sparse structure. Data is stored in table rows, and each row has an unique key with arbitary number of columns. So maybe one row can have thousands of columns, while another row can have only one column. In addition, a column name is a string with the <family>:<label> form, where <family> is a column family name assigned to a group of columns, and label can be any string you like. The concept of column families is that only administrative operations can modify family names, but user can create

15 Chapter 2. Background 9 arbitary labels on demand. A conceptual view of Hbase table can be illustrated below: Row Key Column color: Column prize: Column size: car color:red Mini Cooper medium color:green Volkswagen Beetle small BMW 2.6k Another important aspect of Hbase is that the table is column oriented, which means physically the tables are stored per column. Each column is stored in one file split, and each column family is stored closely in HDFS, all the empty cells in a column are not stored. This feature actually implies that in Hbase it is less expensive to retrieve a column instead of a row, because to retrieve a row the client must request all the column splits, meanwhile to retrieve a column only one column split is requested, which is basically one single file in HDFS. On the other hand, writing a table is row based. Only one row is locked for updating, so all the writes among clusters can be atomic by default. The relationship between Hadoop, Hbase and HDFS can be illustrated as Figure 2.3: Figure 2.3: Relationship between Hadoop, Hbase and HDFS Above overviews are the basic techniques and tools used in this project. The next chapter describes our methods to build up language modeling process for ngram model with smoothing and back-off using Hadoop and Hbase.

16 Chapter 3 Estimating Language Model using Map Reduce The distributed training process described in Google s work[1] is split into three steps: convert words to ids, generate ngrams per sentence, and compute ngram probabilities. We extend it to do a Good-Turing smoothing estimation, so extra steps are included to calculate the count of ngram counts, store the count in HDFS, and then try to fetch the data to adjust the raw counts. We decide to store the ngram string directly, considering more training data can be added and updated later on. We estimate a back-off model to compute the probability, and several database table structures are designed for comparison. The testing process is also using MapReduce, acting like a distributed decoder, so we can process mulitiple testing texts together. Figure 3.1 is the flow chart for training process. Figure 3.2 shows a simplified testing process. The rest of this chapter explains the details of each step in the training and testing process. 3.1 Generate Word Counts The first step is to parse training text, find all the ngrams, and emit their counts. The map function reads one text line as input. The key is the docid, and the value is the text. For each line, it is split into all the unigrams, bigrams, trigrams up to ngrams. These ngrams are the intermediate keys, and the values are a single count 1. Then a combiner function sums up all the values for the same key within Map tasks. And a reduce function same as the combiner collects all the output from combiner, sums up values for the same key. The final key is the same with map function s output, which 10

17 Chapter 3. Estimating Language Model using Map Reduce 11 Figure 3.1: Training Process

18 Chapter 3. Estimating Language Model using Map Reduce 12 Figure 3.2: Testing Process is the ngram, and the value is the raw counts of this ngram throughout the training data set. A partitioner based on the hashcode of first two words is used, which makes sure that not only all the values with a same key goes into one reduce function, but also the average load is balanced [1]. Also we include a pruning count, any raw counts below this pruning count will be dropped. The simplified map and reduce functions can be illustrated as below: map(longwritabe docid, Text line, OutputCollector<Text, IntWritable> output) { words[] = line.split(blank space or punctuation) for i=1..order for k = 0..(words.length - i) output.collect(words[k..(k+i-1)], 1) reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output ) { int sum=0; while (values.hasnext()) sum += values.next().get();

19 Chapter 3. Estimating Language Model using Map Reduce 13 if sum > prune output.collect(key, sum); Figure 3.3 describes the process of map - combine -reduce functions given some sample text. The combine function starts right after map, so it inherits the same key/- value pairs from its previous map task. The output from reduce function is the raw counts, also the keys are sorted, implying some index processes. This feature can be used in enhancements for fast lookups. Figure 3.3: Generate Raw Counts Because this step generates all the ngrams, it is possible to collect the total number of unigram counts, bigram counts etc. These numbers are necessary for smoothing techniques. Here we only collect the unigram counts for Good-Turing smoothing. It is easy to collect the total bigram or trigram counts in the similar way, which is be needed for Kneser-Ney smoothing. enum MyCounter {INPUT_WORDS; reduce(text key, Iterator<IntWritable> values,

20 Chapter 3. Estimating Language Model using Map Reduce 14 OutputCollector<Text, IntWritable> output, Reporter reporter) {... if sum>prune output.collect(key, sum); if key is unigram reporter.incrcounter(mycounter.input_words, 1); 3.2 Generate Count of Counts Good-Turing smoothing is based on the count of ngram counts. We need to collect all the count of counts for unigram, bigram, trigram up to ngram. To do this, all the raw counts are loaded into a new MapReduce job. For each ngram, the map function emits one count of the raw counts along with the ngram order, so the output key has the format of <ngram- order raw-counts>, and the value is <count-of-counts>. The combine and reduce function merge all the count of counts with the same key. The final output should be fairly small, normally a single file is enough to store all the count of counts. The simplified map function is like this: // input key is ngram, input value is the raw counts public void map(text key, IntWritable value, OutputCollector<IntWritable, IntWritable> output){ words[] = tostringarray(key); // words.length is the ngram order // combine the order with the counts String combine = words.length + value.tostring(); output.collect(totext(combine), one); The combine and reduce function are the same with previous step. An example of the MapReduce job is illustrated as Figure 3.4. One important point about the count of counts is that for higher counts, the count of counts is usually incontinuous. This always happens for frequent ngrams, which have a quite high number of counts. We don t have to store or estimate these empty count of counts, instead we can adjust the smoothing formula being used in the next step.

21 Chapter 3. Estimating Language Model using Map Reduce 15 Figure 3.4: Generate Count of Counts 3.3 Generate Good-Turing Smoothing Counts Now we have both the raw counts and the count of counts, following the Good- Turing smoothing formula, we can estimate the smoothed counts for each ngram. In this step, the inputs are still the ngrams and their raw counts, and each map function reads in the count of counts file, stores all the data in a HashMap structure and computes the smoothed counts. The basic formula is : r = (r + 1) N r+1 N r (3.1) if we can find both N r+1 and N r in the HashMap, then the above formula can be applied directly. Otherwise, we try to look up the nearest count of counts, e.g. if we can t find N r+1, we try to look up N r+2, then N r+3, N r+4..etc. We decide to look up at most 5 counts N r+1.. N r+5. If we can t find any of these counts, we use the raw counts intead. In this situation, the raw counts are typically very large, meaning the ngram has a relatively high probability, so we don t have to adjust the counts. For each ngram, the smoothing process is needed only once, so we actually don t need any combine or reduce function for count smoothing. The map function can be illustrated as:

22 Chapter 3. Estimating Language Model using Map Reduce 16 HashMap counthash; void configure(){ reader = openfile(count-of-counts); //k is the ngram-order ngram-counts, v is the count of counts while (reader.next(k, v)) { order = k[0]; counts = k[1..length-1]; counthash.put(order, (counts, v)); reader.close(); //input key is the ngram, value is the raw counts void map(text key, IntWritable value, OutputCollector<Text, Text> output) { for i=1..5 if exists counthash.get(key.length).get(value) && exists counthash.get(key.length).get(value+i) { Nr= counthash.get(key.length).get(value); Nr1= counthash.get(key.length).get(value+i); c=(value+1)*nr1/nr; break; else c=value; //c is the smoothed counts for ngram key Figure 3.5 shows an example of the Good-Turing smoothing. 3.4 Generate Ngram Probability To estimate the probability of one ngram w 1,w 2,...,w n, we need the counts of w 1..w n and w 1..w n 1. Because one MapReduce function, either map or reduce, is working based on one key, in order to collect both above counts, we need to emit the two counts to one reduce function. The approach is to use the current context w 1,w 2,...,w n 1 as the key, and combine the current word w n with the counts as the

23 Chapter 3. Estimating Language Model using Map Reduce 17 Figure 3.5: Good-Turing Smoothing value. This can be done in the map function of previous step. Notice that except for the highest order ngram, all lower order ngrams also need to be emitted with their own smoothed counts, providing the counts for being context as themselves. Figure 3.6 gives an example of the emitted context with current word format.... //c is the smoothed counts for ngram key //suffix is the current word, the last item in string array suffix = key[length-1]; context = key[0..length-2]; output.collect(context, ( \t +suffix+ \t +c)); if order< highest order... output.collect(key, c); Now the reduce function can receive both the counts of the context and the counts with current word, and computes the conditional probability for ngram based on the formula: //the input k is the context p(w n w 1,...,w n 1 ) = f (w 1,...,w n ) f (w 1,...w n 1 ) void reduce(text k, Iterator v ) { (3.2)

24 Chapter 3. Estimating Language Model using Map Reduce 18 Figure 3.6: Emit context with current word while (v.hasnext) { value=v.get; //if has current word if (value starts with \t ) { get current word w get counts C //if no current word else get base counts Cbase //compute the probability prob= C/Cbase; if prob > 1.0 prob=1.0;... After Good-Turing smoothing, some counts might be quite small, so the probability might be over 1.0. In this case we need to adjust it down to 1.0. For a back-off model, we use the simple scheme proposed by Google [1], in which the back-off weight is set to 0.4. The number 0.4 is chosen based on empirical experiments, and is proved as a stable choice in previous works. If we want to estimate dynamic back-off weight

25 Chapter 3. Estimating Language Model using Map Reduce 19 Figure 3.7: Estimate probability for each ngram, more steps are required but people have discussed that the choice of specific back-off or smoothing method is less relevant as the training corpora becomes large [4]. In this step, we get all the probabilities for ngrams, and with the back-off weight, we can estimate probability for testing ngram based on queries. So the next important step is to store these probabilities in the distributed environment. 3.5 Generate Hbase Table Hbase can be used as the data input/output sink in Hadoop MapReduce jobs, so it is straight-forward to integrate it within previous reduce function. Essential modifications are needed, because writing Hbase table is row based, each time we need to generate a key with some context as the column. There are several different choices, either a simple scheme of each ngram per row, or more structured rows based on current word and context. Two major aspects are considered: the write/query speed and the table storage size n-gram Based Structure An initial structure is very simple, similar to the text output format. Each n-gram is stored in a separate row, so the table has a flat structure with one single column. For one row, the key is ngram itself, and the column stores its probability. Table 3.1 is an example of this structure. This structure is easy to implement and maintain, yet it is

26 Chapter 3. Estimating Language Model using Map Reduce 20 only sorted but uncompressed. Considering there might be lots of same probabilities among different ngrams, e.g. higher order ngrams always appear once, the table is not capable to represent all the same probabilities in an efficient way. Also since we only have one column, the advantages of distributed database, arbitary sparse columns, have less effects for this table. key column family:label (gt:prob) a 0.11 a big 0.67 a big house 1.0 buy 0.11 buy a Table 3.1: n-gram based table structure Current Word Based Structure Alternatively, for all the ngrams w 1,w 2,...,w n that share the same current word w n, we can store them in one row with the key w n. All the possible context are stored in seperate columns with the column name format <column family:context>. Table 3.2 is an example. This table is a sparse column structure. For some word with lots of context, the row could be quite long, while for some word with less context the row could be short. In this table we reduce the number of rows, and expand all context into seperate columns. So intead of one single column split, we have lots of small column splits. From the view of distributed database, the data is sparsely stored, but from the view of data structure, it is still somewhat uncompressed, e.g. if we have two current words in two rows, and both of them have the same context, or have the same probability for some context, we still store them seperately. This results in multiple column splits that have very similar structures, which means some kind of redundance. We also need to collect the unigram probability for each current word and store it in a separate column. The process to create table rows can be illustrated as: //k is the ngram, c is the probability

27 Chapter 3. Estimating Language Model using Map Reduce 21 key column family:label gt:unigram gt:a gt:a big gt:i gt:buy.. a big house buy Table 3.2: Word based table structure for each ngram (k, c) { word=k[length-1]; if k.length == 1 column="gt:unigram"; else{ context=k[0..length-2]; column="gt:context"; output.collect(w, (column, c) ); Context Based Structure Similar to current word based structure, we can also use the context w 1,w 2,...,w n 1 as the key for each row, and store all the possible following words w n in seperate columns with the format <column family:word>. This table will have more rows compared with word based table, but still less than ngram based table. For large data set or higher order ngrams, the context can be quite large, on the other hand the columns can be reduced. The column splits are separated by different words, and for all words that only occur once, the split is still very small - only one column value per split. Generally speaking, if we have n unigrams in the data set, we will have around n columns in the table. For a training set containing 100 million words, the number of unigram is around 30,000, so the table could be really sparse. Example of this table structure is illustrated as Table 3.3. To avoid redundancy, only unigram keys are stored with their own probabilities in

28 Chapter 3. Estimating Language Model using Map Reduce 22 key column family:label gt:unigram gt:a gt:big gt:i gt:buy gt:the gt:house.. a a big 1.0 buy buy a 1.0 i Table 3.3: Context based table structure <gt:unigram>, since we can also store the probability of a big in <gt:unigram> which is redundant with row a column <gt:big>. //k is the ngram, c is the probability for each ngram (k, c) { context=k[length-1]; if k.length == 1 column="gt:unigram"; else{ word=k[length-1]; context=k[0..length-2]; column="gt:word"; output.collect(context, (column, c) ); Since there may be many columns that only appear once and have the same value, typically seen in higher order ngrams, possible compression can be made to combine these columns together, reducing the column splits Half Ngram Based Structure For the previous two structures, we can get either large number of rows, or large number of columns. So there is a possible trade off between the rows and the columns. We can combine the word based and context based structure together, balancing the

29 Chapter 3. Estimating Language Model using Map Reduce 23 number of rows and columns. Our method is to split the n grams into n/2 grams, and use n/2 gram as the row key, the rest n/2 gram as the column label. For example, for a 4-gram model (w 1,w 2,w 3,w 4 ), the row key is (w 1 w 2 ), and the column is <gt:w 3 w 4 >. An example for 4-gram model is illustrated as Table 3.4: key column family:label gt:unigram gt:a gt:big gt:house gt:big house gt:new house.. a a big buy buy a i Table 3.4: Half ngram based table structure For higher order ngrams, this structure can reduce lots of rows and insert them as columns. Theoretically the costs of splitting ngram into word-context and contextword are identical, but n/2 gram-n/2 gram will require a bit more parsing. Similar to previous method, the process can be written as: //k is the ngram, c is the probability for each ngram (k, c) { half = int(length/2); context = k[0..half]; if k.length == 1 column="gt:unigram"; else{ word=k[half..length-1]; column="gt:word"; output.collect(context, (column, c) );

30 Chapter 3. Estimating Language Model using Map Reduce Integer Based Structure Instead of store all the strings, we can also convert all the words into integers, and store the integers in table. Extra steps are needed to convert each unigram into an unique integer number and keep the convert unigram-integer map in the distributed filesystem. The advantage of using integers is the size will be smaller than strings, because of storing a long length string with a single integer number. But on the other hand we need another encoding/decoding process to do the converting, and this results in more computational time. This method is a trade off between computational time and the storage size. Also this structure can be combined with previous methods for better compressions. For the simple ngram per row scheme, integer based structure can be illustrated as Table 3.5: unigram integer a 1 big 2 house 3 buy 4 key column family:label (gt:prob) Table 3.5: Integer based table structure Notice that if we store a big as 12, it might conflict with another word with number 12 in the convert map. So we have to add a space between the numbers, similar to the ngram strings. We need an extra MapReduce job to reprocess the raw counts, convert unigram into integer, and store in a table:

31 Chapter 3. Estimating Language Model using Map Reduce 25 //extra step to convert unigram into unique integer //input key is the ngram, value is the raw counts int id; column= convert:integer ; map(text k, IntWritable v, OutputCollector<Text, MapWritabe> output) { if (k.length==1) { id++; output.collect(k, (column, id));... Then we can query the convert table to store integer instead of strings: column= convert:integer ; //k is the ngram, c is the probability for each ngram (k,c) { for each k[i] in k //query the convert table or file, //find the integer for each k[i] intk[i]=get(k[i], column); column="gt:prob"; row = combineinteger(intk); output.collect(context, (column, c) ); 3.6 Direct Query The next process is the testing function. The actual back-off is performed here in the query. Based on the back-off model, for each testing ngram, we need to query the ngram, if not found, then find n-1 gram, until we reach unigram. For different table structures, we just need to generate different row and column names. The advantage of using MapReduce for the testing is that, we can put multiple testing texts into HDFS, and a MapReduce job can process all of them to generate the raw counts, just the same as we have done in the training process, then for each ngram with its counts,

32 Chapter 3. Estimating Language Model using Map Reduce 26 we directly estimate the probability using back-off model, and multiply by the counts. In such a method, each different ngram is processed only once, which speeds up the whole process especially for lower order ngrams. We call this method Direct Query because we query each ngram directly from Hbase table, so the more testing ngrams we have, the more time it will cost. Also the perplexity of the estimation is computed and collected as an evaluation value for the language model. An example using the simple ngram based table structure is: //the job for raw counts //input key is the docid, value is one line in the text map(longwritabe docid, Text line, OutputCollector<Text, IntWritable> output) { words[] = line.split(blank space or punctuation) for i=1..order for k = 0..(words.length - i) output.collect(words[k..(k+i-1)], 1) reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output ) {... //the job for estimating probability //the input key is the ngram, value is the raw counts column= gt:prob ; map(text k, IntWritable v, OutputCollector<Text, FloatWritable> output) { // a back-off calculation row=k.tostring; while (finished==false) { value=table.get(row, column); if (value!= null) { //find the probability prob=alpha*value;

33 Chapter 3. Estimating Language Model using Map Reduce 27 finished=true; else { //unseen unigram if k.length==1 { prob=unseen_prob; finished=true; else { //back-off to n-1 gram ngram=row; row=ngram[0..length-2]; alpha=alpha*0.4; finished=false; //now we have the probability in prob count+=v; if prob >1.0 prob = 1.0 H+=v*log(prob)/log(2); output.collect(k, prob); //compute the total perplexity void close(){ perplexity = pow(2.0, -H / count);... Figure 3.8 is an example of this direct query process. If the estimated probability is above 1.0, we need to adjust it down to 1.0. The map function collects the total counts and computes the total perplexity in the overrided close method. As a reference, the probability for each ngram is collected in HDFS as the output of final reduce job.

34 Chapter 3. Estimating Language Model using Map Reduce 28 Figure 3.8: Direct Query 3.7 Caching Query There is a little more we can do to speed up the query process. Considering queries for different ngram that share the same n-1 gram, in a back-off model we query the ngram first, then if not found, we move on to n-1 gram. Suppose we need to back-off for each ngram, the n-1 gram will be requested for multiple times. Here is where the caching steps in. For each back-off step, we store the n-1 gram probability as HashMap in memory inside the working node. Everytime when node comes to a new n-1 backoff query, it will first look up in the HashMap, if not found, then ask the Hbase table, and add the new n-1 gram into HashMap. The simplified scheme can be written as:... //k is the ngram HashMap cache; while not finished { prob = table.get(k, column); if prob!= null found probability, finished

35 Chapter 3. Estimating Language Model using Map Reduce 29 else { let row be the n-1 gram from k if cache.exists(row) prob = cache.get(row) else { prob = table.get(row, column); if prob!=null { found probability, finished if number of cache.keys < maxlimit cache.add(row, prob); else cache.clear(); cache.add(row, prob);... Figure 3.9 is an example of this caching query process. We don t need to store probabilities for ngrams, only the n-1 grams. Also there is a maximum limit of the number of keys in the HashMap. We can t store all the n-1 grams into HashMap, otherwise it will become huge and eat up all the working node s memory. So we only store up to maxlimit n-1 grams, and when counts are over the limit, the previous HashMap is dropped and filled in new items. It is like an updating process, and another alternative is to delete the first key in HashMap and push in new one. Above methods establish the whole process of distributed language model training and testing. The next chapter describes the evaluation methods we use.

36 Chapter 3. Estimating Language Model using Map Reduce 30 Figure 3.9: Caching Query

37 Chapter 4 Evaluation Methods Our major interest is to explore the computational and storage cost building distributed language model using Hadoop MapReduce framework and Hbase distributed database. The evaluation focuses on the comparison of time and space using different table structures. Also as language model the perplexity for a testing set is evaluated and compared with traditional language modeling tools. 4.1 Time and Space for Building LM There are two major processes we can evaluate, the training process and testing process. Because we are experimenting different table structures, we split the training process into two steps: the first one is generating raw counts and collecting Good- Turing smoothing parameters, the second one is generating table. Apparently the first step is the same for all different tables, so we can focus on the second step for comparison. The comparison of time cost is based on the average value of program running time taken from multiple runs to avoid deviations. Network latency and other disturbance may vary the result, but the error should be remaining in 1-2 minutes level. The program calculates its running time by comparing the current system time before and after the MapReduce job is submitted and executed. Each table structure is compared with the same ngram order, also different ngram order for one table is compared to see the relationship between the order and the time cost in each method. To collect the size of the language model, we can use command-line programs provided in Hadoop framework. Because the Hbase table is actually stored in HDFS filesystem, we can directly calculate the size of the directory created by Hbase. Typ- 31

38 Chapter 4. Evaluation Methods 32 ically, Hbase creates a directory called /hbase in HDFS, and creates a sub directory with the name of the table, e.g. if we create a table named trigram, then we can calculate the size of the directory /hbase/trigram as the size of the table. It is not a perfectly accurate estimation of the table data because the directory contains other meta info files but since these files are usually quite small, we can estimate them together. Another point of view is that the table has two parts: the data and the info, so we can calculate these two parts together. 4.2 LM Perplexity Comparison For a language model, the perplexity for testing set is a common evaluation to see how good the model is. The perplexity means the average choice for each ngram. Generally speaking, the better model it is, the lower perplexity it will get. The order of ngram also affects the perplexity for the same model. For a normal size training set, higher order ngram model will always get lower perplexity. Meanwhile, the more training set we have, the better model we will get, so the perplexity will become lower. We can also compare the distributed language model with traditional language modeling tools like SRILM [6]. SRILM is a rich functional package for building and evaluating language models. The package is written in C++, and because it runs locally in single computer, the processing speed is fast. The shortcoming of SRILM is it will eat up memory and even overflow when processing huge amount of data. Still we can compare SRILM with our distributed language model on the same training set which is not so huge. The smoothing methods need to be nearly the same, e.g. Good-Turing smoothing. The specific parameters may vary, but the similar smoothing method can show whether the distributed language model is stable enough compared with traditional language modeling package. Applying these evaluation methods, the experiments and results are illustrated in the next chapter.

39 Chapter 5 Experiments The experiments are done in a cluster environment with 2 working nodes and 1 master server. The working nodes are running Hadoop, HDFS and Hbase slaves, and the master server controls all of them. For each experiment, we repeat three times and choose the average value as the result. The results are shown as figures and tables for different ngram orders of all the table structures, the time and space cost, also the perplexity for testing set. We first compare the size and number of unique ngrams for training and testing data taken from a 100 million-word corpus. The time and space cost for generating raw counts, generating table structures, finally the testing queries are compared separately. We also compare the unigram, bigram and trigram models for each step. All of the 5 table structures we proposed are compared on time cost for both training and testing process, as well as the table size of these different structures. Finally the perplexity for each ngram order is compared with SRILM, and the model data size from SRILM is calculated as a reference. 5.1 Data The data is taken from British National Corpus, which is around 100 million words. We choose some random texts from the corpus as the testing set. The testing set is about 35,000 words. All the remaining texts are used as training set. The corpus is constructed as one sentence per line. For completeness we include all the punctuation in the sentence as the part of ngam, e.g. <the house s door> is parsed as <the> <house> < s> <door>. The approximate words and file size for training and testing data is: 33

40 Chapter 5. Experiments 34 training data testing data tokens 110 million 40,000 data size 541MB 202KB Table 5.1: Data Notice that the tokens contain both the words and the punctuation, slightly increasing the total number of ngrams. 5.2 Ngram Order We choose to train up to trigram model for the training process. A counts pruning number of 1 is applied for all the models. Table 5.2 shows the number of unique ngrams for each order on the training and testing data set. Figure 5.1 draws the unique ngram number for training and testing data. We can see from the figure that the training data has a sharper line. Considering the different number of tokens, it shows that larger data set results in more trigrams, meaning more varieties. training data testing data tokens 110 million 40,000 unigram 284,921 7,303 bigram 4,321,467 25,862 trigram 9,090,713 33,120 Table 5.2: Ngram numbers for different order For all the different table structures, the steps 1 & 2 of raw counting and Good- Turing parameter estimating are identical. So we can evaluate this part first, apart from choosing table types. The time and space costs are compared in Table 5.3 and Table 5.4. For testing process only the raw counting step is required. All the raw counting outputs are stored in HDFS as compressed binary files. The trend line is shown in Figure 5.2. There is a big difference between the two lines. As we can see from Table 5.3, when the tokens are relatively small, the processing time is nearly the same, but when the tokens become large, the difference between unigram, bigram and trigram model is largely increasing. Another aspect affecting the

41 Chapter 5. Experiments 35 Figure 5.1: Unique ngram number Figure 5.2: Time cost in raw counting

42 Chapter 5. Experiments 36 training data testing data raw counting & parameter estimating raw counting tokens 110 million 40,000 unigram 15 min 38 sec 19 sec bigram 41 min 6 sec 20 sec trigram 75 min 41 sec 22 sec Table 5.3: Time cost in raw counting time cost is that during the training process, a second MapReduce job is needed to estimate the count of counts for Good-Turing smoothing. For this job, only one reduce task is launched in the working nodes, producing a single output file. It is easy to know that as the ngram order increases, the input records also become larger, requiring more processing time. training data testing data raw counting & parameter estimating raw counting tokens 110 million 40,000 unigram 1.5 MB 35 KB bigram 22 MB 141 KB trigram 49 MB 234 KB Table 5.4: Space cost in raw counting Table 5.4 shows the size of raw counts file for each ngram order. We can see the size is increasing rapidly, implying the size of the table will also increase sharply. Notice that in Table 5.4 the size is only for each ngram order, but in our table, we need to store all the orders from unigram, bigram up to ngram. So the table s size is the total sum, further increasing the number. Figure 5.3 suggests that training and testing data have the similar up going trend line, meaning the space cost is less relevant with the number of tokens, and it is like a monotone increasing function with the ngram order.

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta,

More information

Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Perplexity Method on the N-gram Language Model Based on Hadoop Framework 94 International Arab Journal of e-technology, Vol. 4, No. 2, June 2015 Perplexity Method on the N-gram Language Model Based on Hadoop Framework Tahani Mahmoud Allam 1, Hatem Abdelkader 2 and Elsayed Sallam

More information

Chapter 7. Language models. Statistical Machine Translation

Chapter 7. Language models. Statistical Machine Translation Chapter 7 Language models Statistical Machine Translation Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

CS 533: Natural Language. Word Prediction

CS 533: Natural Language. Word Prediction CS 533: Natural Language Processing Lecture 03 N-Gram Models and Algorithms CS 533: Natural Language Processing Lecture 01 1 Word Prediction Suppose you read the following sequence of words: Sue swallowed

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Cloudera Certified Developer for Apache Hadoop

Cloudera Certified Developer for Apache Hadoop Cloudera CCD-333 Cloudera Certified Developer for Apache Hadoop Version: 5.6 QUESTION NO: 1 Cloudera CCD-333 Exam What is a SequenceFile? A. A SequenceFile contains a binary encoding of an arbitrary number

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hypertable Architecture Overview

Hypertable Architecture Overview WHITE PAPER - MARCH 2012 Hypertable Architecture Overview Hypertable is an open source, scalable NoSQL database modeled after Bigtable, Google s proprietary scalable database. It is written in C++ for

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

MapReduce With Columnar Storage

MapReduce With Columnar Storage SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Large-Scale Test Mining

Large-Scale Test Mining Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

PERFORMANCE TIPS FOR BATCH JOBS

PERFORMANCE TIPS FOR BATCH JOBS PERFORMANCE TIPS FOR BATCH JOBS Here is a list of effective ways to improve performance of batch jobs. This is probably the most common performance lapse I see. The point is to avoid looping through millions

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project

Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone pbone@csse.unimelb.edu.au June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Bigtable is a proven design Underpins 100+ Google services:

Bigtable is a proven design Underpins 100+ Google services: Mastering Massive Data Volumes with Hypertable Doug Judd Talk Outline Overview Architecture Performance Evaluation Case Studies Hypertable Overview Massively Scalable Database Modeled after Google s Bigtable

More information

ibigtable: Practical Data Integrity for BigTable in Public Cloud

ibigtable: Practical Data Integrity for BigTable in Public Cloud ibigtable: Practical Data Integrity for BigTable in Public Cloud Wei Wei North Carolina State University 890 Oval Drive Raleigh, North Carolina, United States wwei5@ncsu.edu Ting Yu North Carolina State

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team Lecture Summary In this lecture, we learned about the ADT Priority Queue. A

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Study and Comparison of Elastic Cloud Databases : Myth or Reality? Université Catholique de Louvain Ecole Polytechnique de Louvain Computer Engineering Department Study and Comparison of Elastic Cloud Databases : Myth or Reality? Promoters: Peter Van Roy Sabri Skhiri

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing. Basic Steps in Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24. Data Federation Administration Tool Guide SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package 7 2015-11-24 Data Federation Administration Tool Guide Content 1 What's new in the.... 5 2 Introduction to administration

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Media Upload and Sharing Website using HBASE

Media Upload and Sharing Website using HBASE A-PDF Merger DEMO : Purchase from www.a-pdf.com to remove the watermark Media Upload and Sharing Website using HBASE Tushar Mahajan Santosh Mukherjee Shubham Mathur Agenda Motivation for the project Introduction

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Symbol Tables. Introduction

Symbol Tables. Introduction Symbol Tables Introduction A compiler needs to collect and use information about the names appearing in the source program. This information is entered into a data structure called a symbol table. The

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Distributed storage for structured data

Distributed storage for structured data Distributed storage for structured data Dennis Kafura CS5204 Operating Systems 1 Overview Goals scalability petabytes of data thousands of machines applicability to Google applications Google Analytics

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information