Estimating Language Models Using Hadoop and Hbase

Transcription

1 Estimating Language Models Using Hadoop and Hbase Xiaoyang Yu E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2008

2 Abstract This thesis presents the work of building a large scale distributed ngram language model using a MapReduce platform named Hadoop and a distributed database called Hbase. We propose a method focusing on the time cost and storage size of the model, exploring different Hbase table structures and compression approaches. The method is applied to build up training and testing processes using Hadoop MapReduce framework and Hbase. The experiments evaluate and compare different table structures on training 100 million words for unigram, bigram and trigram models, and the results suggest a table based on half ngram structure is a good choice for distributed language model. The results of this work can be applied and further developed in machine translation and other large scale distributed language processing areas. i

3 Acknowledgements Many thanks to my supervisor Miles Osborne for the numerous advices, great supports and for inspiring new ideas about this project during our meetings. I would also like to thank my parent for their trusts in me and encouragements. Thanks a lot to Zhao Rui for her great suggestions for the thesis. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Xiaoyang Yu) iii

5 Table of Contents 1 Introduction 1 2 Background Ngram Language Model Distributed Language Modeling Hadoop MapReduce Framework Hbase Database Estimating Language Model using Map Reduce Generate Word Counts Generate Count of Counts Generate Good-Turing Smoothing Counts Generate Ngram Probability Generate Hbase Table n-gram Based Structure Current Word Based Structure Context Based Structure Half Ngram Based Structure Integer Based Structure Direct Query Caching Query Evaluation Methods Time and Space for Building LM LM Perplexity Comparison Experiments Data iv

6 5.2 Ngram Order Table Structures Discussion Conclusion Future Works Bibliography 47 A Source Code 49 A.1 NgramCount.java A.2 TableGenerator.java A.3 NgramModel.java v

7 Chapter 1 Introduction In statistical natural language processing, ngram language model is widely used in various areas like machine translation and speech recognition etc. Ngram model is trained using large sets of unlabelled texts to estimate a probability model. Generally speaking, more data gives better model. As we can easily get huge training texts from corpus or web, the computational power and storage space of single computer are both limited. Distributed language models are proposed to meet the needs of processing vast amount of data, trying to solve above problems. In the field of distributed language model processing, most of the relevant works mainly concentrated on establishing the methods for model training and testing, but no detailed discussions about the storage structure. The objective of our work is to efficiently construct and store the model in a distributed cluster environment. Moreover, our interest is to explore the capability of coping different database table structures with language models. The novel aspects of our project are as follows. We show that a distributed database can be well integrated with distributed language modeling, providing the input/output data sink for both the distributed training and testing processes. Using distributed database helps to reduce the computational and storage cost. Meanwhile, the choice of database table structure affects the efficiency a lot. We find that half ngram based table structure is a good choice for distributed language models from the aspects of time and space cost, comparing with other structures in our experiments. We use a distributed computing platform called Hadoop, which is an open source implementation of the MapReduce framework proposed by Google [3]. A distributed database named Hbase is used on top of Hadoop, providing a model similar to Google s Bigtable storage structure [2]. The training methods are based on Google s work on large language model in machine translation [1], and we propose a different method 1

8 Chapter 1. Introduction 2 using Good-Turing smoothing with a back-off model and pruning. We store the ngram model in the distributed database with 5 different table structures, and a back-off estimation is performed in a testing process using MapReduce. We propose different table structures based on ngram, current word, context, half ngram and converted integer. Using Hadoop and Hbase, we build up unigram, bigram and trigram models with 100 million words from British National Corpus. All the table structures are evaluated and compared for each ngram order with a testing set of 35,000 words. The training and testing processes are split into several steps. For each step the time and space cost are compared in the experiments. The perplexity of testing set is also compared with traditional language modeling package SRILM [6]. The results are discussed about the choice of proper table structure for the consideration of efficiency. The rest of this thesis is organised as following: Chapter 2 introduces the ngram language model, relative works on distributed language modeling, the Hadoop MapReduce framework and Hbase distributed database; Chapter 3 gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed; Chapter 4 describes the evaluation methods we use; Chapter 5 shows all the experiments and results; Chapter 6 discusses about the choice of table structure for Hadoop/Hbase and possible future works.

9 Chapter 2 Background In statistical language processing, it is essential to have a language model which measures the probability of how likely a sequence of words may occur in some context. Ngram language model is the major language modeling method, along with smoothing and back-off methods to deal with the problem of data sparseness. Estimating this in a distributed environment enables us to process vast amount of data and get a better language model. Hadoop is a distributed computing framework that can be used for language modeling, and Hbase is a distributed database which may store the model data as database tables and integrate with Hadoop platform. 2.1 Ngram Language Model Ngram model is the leading method for statistical language processing. It tries to predict the next word w n given n-1 words context w 1,w 2,...,w n 1 by estimating the probability function: P(w n w 1,...,w n 1 ) (2.1) Usually a Markov assumption is applied that only the prior local context - the last n-1 words - affects the next word. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: p(w n w 1,...,w n 1 ) = f (w 1,...,w n ) f (w 1,...w n 1 ) (2.2) where f (w 1,...,w n ) is the counts of how many times we see the sentence w 1,...,w n in the corpus. One important aspect is the count smoothing, which adjusts the empirical counts collected from training texts to the expected counts for ngrams. Considering 3

10 Chapter 2. Background 4 ngrams that don t appear in the training set, but are seen in testing text, for a maximum likelihood estimation the probability would be zero, so we need to find a better estimation for the expected counts. A popular smoothing method called Good-Turing smoothing is based on the count of ngram counts, the expected counts are adjusted with the formula: r = (r + 1) N r+1 N r (2.3) where r is the actual counts, N r is the count of ngrams that occur exactly r times. For words that are seen frequently, r can be large numbers, and N r+1 is likely to be zero, in this case derivations can be made to look up N r+2,n r+3,...n r+n, find a nonzero count and use that value instead. The idea of count smoothing is to better estimate probabilities for ngrams, meanwhile, for unseen ngrams, we can to do a back-off estimation using the probability of lower order tokens, which is usually more robust in the model. For a ngram w 1,...,w n that doesn t appear in training set, we estimate the probability of n-1 gram w 2,...,w n instead: p(w n w 1,...,w n 1 ) if found (w 1,...,w n ) P(w n w 1,...,w n 1 ) = λ(w 1,...,w n 1 ) p(w n w 2,...,w n 1 ) otherwise (2.4) where λ(w 1,...,w n 1 ) is the back-off weight. In general, back-off requires more lookups and computations. Modern back-off smoothing techniques like Kneser-Ney smoothing [5] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate the language model, we can calculate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one ngram prediction. The cross entropy is the average entropy of each word prediction in the ngram: H(p LM ) = 1 n p LM(w 1,w 2,...,w n ) = 1 n n i=1 Perplexity is defined based on cross entropy: logp LM (w n w 1,...,w n 1 ) (2.5) PP = 2 H(p LM) 1 n = 2 n i=1 logp LM (w n w 1,...,w n 1 ) (2.6)

11 Chapter 2. Background 5 Generally speaking, perplexity is the average number of choices at each word prediction. So the lower perplexity we get, the higher probability of the prediction is, meaning a better language model. 2.2 Distributed Language Modeling large scale distributed language modeling is quite a new topic. Typically using a server/client diagram as Figure 2.1, we need to efficiently manipulate on large data sets, to communicate with cluster workers and organise their works. The server controls and distributes tasks to all the workers, and clients send requests to server to execute queries or commands. Relative ideals can be seen in Distributed Information Retrieval [7]. Recent works include a method to split a large corpus into many non-overlapping chunks, and make each worker in the cluster load one of the chunks with its suffix array index [8]. The use of suffix array index helps to quickly find the proper context we want, and we can count the words occurrence in each worker simultaneously. In this kind of approach, the raw word counts are stored and served in the distributed system, and the clients collect needed counts and then compute the probability. The system is applied on N-best list re-ranking problem and shows a nice improvement on a 2.97 billion-word corpus. Figure 2.1: Server/Client Architecture A similar architecture is proposed later with interpolated models [4]. The corpus

12 Chapter 2. Background 6 is also split into chunks along with their suffix array index. Also a smoothed n-gram model is computed and stored separately in some other workers. Then the client requests both the raw word counts and the smoothed probabilities from different workers, and computes the final probability by linear weighted blending. The authors apply this approach on N-best list re-ranking and integrate with machine translation decoder to show good improvement in translation quality trained on 4 billion words. Previous two methods solve the problem to store large corpus and provide word counts for clients to estimate probabilities. Later works [1] describe a method to store only the smoothed probabilities for distributed n-gram models. In previous methods, the client needs to look up each worker to find proper context using suffix array index. On the other hand this method result in exactly one worker being contacted per n-gram, and in exactly two workers for context-depended backoff [1]. The authors propose a new backoff method called Stupid Backoff, which is a simpler scheme using only the frequencies and an empirical chosen back-off weight. f (w 1,...,w n ) if f (w f (w 1,...w n 1 ) 1,...,w n ) > 0 P(w n w 1,...,w n 1 ) = α f (w 2,...,w n ) otherwise f (w 2,...w n 1 ) (2.7) where α is set to 0.4 based on their earlier experiments [1]. The experiments directly store 5-grams model using different sources. According to their results, 1.8T tokens from web include a 16M vocabulary, and generate a 300G n-grams with a 1.8T language model with their Stupid Backoff. These previous works are the theoretical principles of our project. The last work used Google s distributed programming model MapReduce [3], but the authors didn t describe the way to store the data. We adapt the MapReduce model for our work, because it contains a clear workflow and has already been proved as a mature model and widely used in various applications by Google, Yahoo and other companies. Although Google s own implementation of MapReduce is proprietary, we can still choose open source implementation of this model. 2.3 Hadoop MapReduce Framework Hadoop is an open source implementation of MapReduce programming model. It is based on Java and uses the Hadoop Distributed File System (HDFS) to create multiple replicas of data blocks for reliability, distributing them around the clusters

13 Chapter 2. Background 7 and splitting the task into small blocks. According to their website, Hadoop has been demonstrated on 2,000 nodes and is designed up to support 10,000 node clusters, so it enables us to extend our clusters in the future. A general MapReduce architecture 1 can be illustrated as Figure 2.2. At first the input files are split into small blocks named FileSplits, and the Map operation is created parallelized with one task per FileSplit. Figure 2.2: MapReduce Architecture Input and Output types of a Map-Reduce job: (input) <k1, v1> map <k2, v2> combine <k2, v2> reduce <k3, v3> (output) The FileSplit input is treated as a key/value pair, and user specifies a Map function to process the key/value pair to generate a set of intermediate key/value pairs [3]. When the Map operation is finished, the output is passed to Partitioner, usually a hash function, so all the pairs shared with same key can be collected together later on. After the intermediate pairs are generated, a Combine function is called to do a reduce-like job in each Map node to speed up the processing. Then a Reduce function merges all intermediate values associated with the same intermediate key and writes to output files. Map and Reduce operations are working independently on small blocks of data. The final output will be one file per executed reduce task, and Hadoop stores the output files in HDFS. For input text files, each line is parsed as one value string, so the Map function is processing at the sentence level; for output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs 1

14 Chapter 2. Background 8 are text files, and what we will finally get from Hadoop are ngram/probability pairs. It means that, theoretically we can only use Hadoop to build language model. This chapter gives a brief overview of ngram language models, the smoothing and back-off methods, the relevant works on distributed language modeling, the Hadoop framework and Hbase database. 2.4 Hbase Database Hbase is the distributed database on top of HDFS, whose structure is very similar to Google s Bigtable model. The reason we look into using Hbase with language model is, although we can do our work all in Hadoop and store all outputs as text or binary files in HDFS, it will consume more time and extra computations for parsing input records. The bottleneck of Hadoop MapReduce is that the input/output is file based, either plain text or binary file, which is reasonable for storing large amount of data but not suitable for query process. For example, if we want to query probability for one ngram, we have to load all the files into map function, parse all of them, and compare the key with ngram to find the probability value. Basically we need to do this comparison for each ngram in test texts, which will cost quite a long time. Instead of parsing files, we can make use of a database structure such as Hbase, to store the ngram probabilities in database table. The advantage is obvious: database structure is designed to meet the needs of multiple queries; the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. As in distributed database, the table is stored in distributed filesystem, providing scalable and reliable storage. Meanwhile, for language modeling, the model data is highly structured. The basic format is ngram/probability pair, which can be easily constructed into more organised and compact structures. Considering we may get huge amount of data, the compressed structures are essential from both time and storage aspects. Hbase stores data in labelled tables. The table is designed to have a sparse structure. Data is stored in table rows, and each row has an unique key with arbitary number of columns. So maybe one row can have thousands of columns, while another row can have only one column. In addition, a column name is a string with the <family>:<label> form, where <family> is a column family name assigned to a group of columns, and label can be any string you like. The concept of column families is that only administrative operations can modify family names, but user can create

15 Chapter 2. Background 9 arbitary labels on demand. A conceptual view of Hbase table can be illustrated below: Row Key Column color: Column prize: Column size: car color:red Mini Cooper medium color:green Volkswagen Beetle small BMW 2.6k Another important aspect of Hbase is that the table is column oriented, which means physically the tables are stored per column. Each column is stored in one file split, and each column family is stored closely in HDFS, all the empty cells in a column are not stored. This feature actually implies that in Hbase it is less expensive to retrieve a column instead of a row, because to retrieve a row the client must request all the column splits, meanwhile to retrieve a column only one column split is requested, which is basically one single file in HDFS. On the other hand, writing a table is row based. Only one row is locked for updating, so all the writes among clusters can be atomic by default. The relationship between Hadoop, Hbase and HDFS can be illustrated as Figure 2.3: Figure 2.3: Relationship between Hadoop, Hbase and HDFS Above overviews are the basic techniques and tools used in this project. The next chapter describes our methods to build up language modeling process for ngram model with smoothing and back-off using Hadoop and Hbase.

16 Chapter 3 Estimating Language Model using Map Reduce The distributed training process described in Google s work[1] is split into three steps: convert words to ids, generate ngrams per sentence, and compute ngram probabilities. We extend it to do a Good-Turing smoothing estimation, so extra steps are included to calculate the count of ngram counts, store the count in HDFS, and then try to fetch the data to adjust the raw counts. We decide to store the ngram string directly, considering more training data can be added and updated later on. We estimate a back-off model to compute the probability, and several database table structures are designed for comparison. The testing process is also using MapReduce, acting like a distributed decoder, so we can process mulitiple testing texts together. Figure 3.1 is the flow chart for training process. Figure 3.2 shows a simplified testing process. The rest of this chapter explains the details of each step in the training and testing process. 3.1 Generate Word Counts The first step is to parse training text, find all the ngrams, and emit their counts. The map function reads one text line as input. The key is the docid, and the value is the text. For each line, it is split into all the unigrams, bigrams, trigrams up to ngrams. These ngrams are the intermediate keys, and the values are a single count 1. Then a combiner function sums up all the values for the same key within Map tasks. And a reduce function same as the combiner collects all the output from combiner, sums up values for the same key. The final key is the same with map function s output, which 10

17 Chapter 3. Estimating Language Model using Map Reduce 11 Figure 3.1: Training Process

18 Chapter 3. Estimating Language Model using Map Reduce 12 Figure 3.2: Testing Process is the ngram, and the value is the raw counts of this ngram throughout the training data set. A partitioner based on the hashcode of first two words is used, which makes sure that not only all the values with a same key goes into one reduce function, but also the average load is balanced [1]. Also we include a pruning count, any raw counts below this pruning count will be dropped. The simplified map and reduce functions can be illustrated as below: map(longwritabe docid, Text line, OutputCollector<Text, IntWritable> output) { words[] = line.split(blank space or punctuation) for i=1..order for k = 0..(words.length - i) output.collect(words[k..(k+i-1)], 1) reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output ) { int sum=0; while (values.hasnext()) sum += values.next().get();

19 Chapter 3. Estimating Language Model using Map Reduce 13 if sum > prune output.collect(key, sum); Figure 3.3 describes the process of map - combine -reduce functions given some sample text. The combine function starts right after map, so it inherits the same key/- value pairs from its previous map task. The output from reduce function is the raw counts, also the keys are sorted, implying some index processes. This feature can be used in enhancements for fast lookups. Figure 3.3: Generate Raw Counts Because this step generates all the ngrams, it is possible to collect the total number of unigram counts, bigram counts etc. These numbers are necessary for smoothing techniques. Here we only collect the unigram counts for Good-Turing smoothing. It is easy to collect the total bigram or trigram counts in the similar way, which is be needed for Kneser-Ney smoothing. enum MyCounter {INPUT_WORDS; reduce(text key, Iterator<IntWritable> values,

20 Chapter 3. Estimating Language Model using Map Reduce 14 OutputCollector<Text, IntWritable> output, Reporter reporter) {... if sum>prune output.collect(key, sum); if key is unigram reporter.incrcounter(mycounter.input_words, 1); 3.2 Generate Count of Counts Good-Turing smoothing is based on the count of ngram counts. We need to collect all the count of counts for unigram, bigram, trigram up to ngram. To do this, all the raw counts are loaded into a new MapReduce job. For each ngram, the map function emits one count of the raw counts along with the ngram order, so the output key has the format of <ngram- order raw-counts>, and the value is <count-of-counts>. The combine and reduce function merge all the count of counts with the same key. The final output should be fairly small, normally a single file is enough to store all the count of counts. The simplified map function is like this: // input key is ngram, input value is the raw counts public void map(text key, IntWritable value, OutputCollector<IntWritable, IntWritable> output){ words[] = tostringarray(key); // words.length is the ngram order // combine the order with the counts String combine = words.length + value.tostring(); output.collect(totext(combine), one); The combine and reduce function are the same with previous step. An example of the MapReduce job is illustrated as Figure 3.4. One important point about the count of counts is that for higher counts, the count of counts is usually incontinuous. This always happens for frequent ngrams, which have a quite high number of counts. We don t have to store or estimate these empty count of counts, instead we can adjust the smoothing formula being used in the next step.

21 Chapter 3. Estimating Language Model using Map Reduce 15 Figure 3.4: Generate Count of Counts 3.3 Generate Good-Turing Smoothing Counts Now we have both the raw counts and the count of counts, following the Good- Turing smoothing formula, we can estimate the smoothed counts for each ngram. In this step, the inputs are still the ngrams and their raw counts, and each map function reads in the count of counts file, stores all the data in a HashMap structure and computes the smoothed counts. The basic formula is : r = (r + 1) N r+1 N r (3.1) if we can find both N r+1 and N r in the HashMap, then the above formula can be applied directly. Otherwise, we try to look up the nearest count of counts, e.g. if we can t find N r+1, we try to look up N r+2, then N r+3, N r+4..etc. We decide to look up at most 5 counts N r+1.. N r+5. If we can t find any of these counts, we use the raw counts intead. In this situation, the raw counts are typically very large, meaning the ngram has a relatively high probability, so we don t have to adjust the counts. For each ngram, the smoothing process is needed only once, so we actually don t need any combine or reduce function for count smoothing. The map function can be illustrated as:

22 Chapter 3. Estimating Language Model using Map Reduce 16 HashMap counthash; void configure(){ reader = openfile(count-of-counts); //k is the ngram-order ngram-counts, v is the count of counts while (reader.next(k, v)) { order = k[0]; counts = k[1..length-1]; counthash.put(order, (counts, v)); reader.close(); //input key is the ngram, value is the raw counts void map(text key, IntWritable value, OutputCollector<Text, Text> output) { for i=1..5 if exists counthash.get(key.length).get(value) && exists counthash.get(key.length).get(value+i) { Nr= counthash.get(key.length).get(value); Nr1= counthash.get(key.length).get(value+i); c=(value+1)*nr1/nr; break; else c=value; //c is the smoothed counts for ngram key Figure 3.5 shows an example of the Good-Turing smoothing. 3.4 Generate Ngram Probability To estimate the probability of one ngram w 1,w 2,...,w n, we need the counts of w 1..w n and w 1..w n 1. Because one MapReduce function, either map or reduce, is working based on one key, in order to collect both above counts, we need to emit the two counts to one reduce function. The approach is to use the current context w 1,w 2,...,w n 1 as the key, and combine the current word w n with the counts as the

23 Chapter 3. Estimating Language Model using Map Reduce 17 Figure 3.5: Good-Turing Smoothing value. This can be done in the map function of previous step. Notice that except for the highest order ngram, all lower order ngrams also need to be emitted with their own smoothed counts, providing the counts for being context as themselves. Figure 3.6 gives an example of the emitted context with current word format.... //c is the smoothed counts for ngram key //suffix is the current word, the last item in string array suffix = key[length-1]; context = key[0..length-2]; output.collect(context, ( \t +suffix+ \t +c)); if order< highest order... output.collect(key, c); Now the reduce function can receive both the counts of the context and the counts with current word, and computes the conditional probability for ngram based on the formula: //the input k is the context p(w n w 1,...,w n 1 ) = f (w 1,...,w n ) f (w 1,...w n 1 ) void reduce(text k, Iterator v ) { (3.2)

24 Chapter 3. Estimating Language Model using Map Reduce 18 Figure 3.6: Emit context with current word while (v.hasnext) { value=v.get; //if has current word if (value starts with \t ) { get current word w get counts C //if no current word else get base counts Cbase //compute the probability prob= C/Cbase; if prob > 1.0 prob=1.0;... After Good-Turing smoothing, some counts might be quite small, so the probability might be over 1.0. In this case we need to adjust it down to 1.0. For a back-off model, we use the simple scheme proposed by Google [1], in which the back-off weight is set to 0.4. The number 0.4 is chosen based on empirical experiments, and is proved as a stable choice in previous works. If we want to estimate dynamic back-off weight

25 Chapter 3. Estimating Language Model using Map Reduce 19 Figure 3.7: Estimate probability for each ngram, more steps are required but people have discussed that the choice of specific back-off or smoothing method is less relevant as the training corpora becomes large [4]. In this step, we get all the probabilities for ngrams, and with the back-off weight, we can estimate probability for testing ngram based on queries. So the next important step is to store these probabilities in the distributed environment. 3.5 Generate Hbase Table Hbase can be used as the data input/output sink in Hadoop MapReduce jobs, so it is straight-forward to integrate it within previous reduce function. Essential modifications are needed, because writing Hbase table is row based, each time we need to generate a key with some context as the column. There are several different choices, either a simple scheme of each ngram per row, or more structured rows based on current word and context. Two major aspects are considered: the write/query speed and the table storage size n-gram Based Structure An initial structure is very simple, similar to the text output format. Each n-gram is stored in a separate row, so the table has a flat structure with one single column. For one row, the key is ngram itself, and the column stores its probability. Table 3.1 is an example of this structure. This structure is easy to implement and maintain, yet it is

26 Chapter 3. Estimating Language Model using Map Reduce 20 only sorted but uncompressed. Considering there might be lots of same probabilities among different ngrams, e.g. higher order ngrams always appear once, the table is not capable to represent all the same probabilities in an efficient way. Also since we only have one column, the advantages of distributed database, arbitary sparse columns, have less effects for this table. key column family:label (gt:prob) a 0.11 a big 0.67 a big house 1.0 buy 0.11 buy a Table 3.1: n-gram based table structure Current Word Based Structure Alternatively, for all the ngrams w 1,w 2,...,w n that share the same current word w n, we can store them in one row with the key w n. All the possible context are stored in seperate columns with the column name format <column family:context>. Table 3.2 is an example. This table is a sparse column structure. For some word with lots of context, the row could be quite long, while for some word with less context the row could be short. In this table we reduce the number of rows, and expand all context into seperate columns. So intead of one single column split, we have lots of small column splits. From the view of distributed database, the data is sparsely stored, but from the view of data structure, it is still somewhat uncompressed, e.g. if we have two current words in two rows, and both of them have the same context, or have the same probability for some context, we still store them seperately. This results in multiple column splits that have very similar structures, which means some kind of redundance. We also need to collect the unigram probability for each current word and store it in a separate column. The process to create table rows can be illustrated as: //k is the ngram, c is the probability

27 Chapter 3. Estimating Language Model using Map Reduce 21 key column family:label gt:unigram gt:a gt:a big gt:i gt:buy.. a big house buy Table 3.2: Word based table structure for each ngram (k, c) { word=k[length-1]; if k.length == 1 column="gt:unigram"; else{ context=k[0..length-2]; column="gt:context"; output.collect(w, (column, c) ); Context Based Structure Similar to current word based structure, we can also use the context w 1,w 2,...,w n 1 as the key for each row, and store all the possible following words w n in seperate columns with the format <column family:word>. This table will have more rows compared with word based table, but still less than ngram based table. For large data set or higher order ngrams, the context can be quite large, on the other hand the columns can be reduced. The column splits are separated by different words, and for all words that only occur once, the split is still very small - only one column value per split. Generally speaking, if we have n unigrams in the data set, we will have around n columns in the table. For a training set containing 100 million words, the number of unigram is around 30,000, so the table could be really sparse. Example of this table structure is illustrated as Table 3.3. To avoid redundancy, only unigram keys are stored with their own probabilities in

28 Chapter 3. Estimating Language Model using Map Reduce 22 key column family:label gt:unigram gt:a gt:big gt:i gt:buy gt:the gt:house.. a a big 1.0 buy buy a 1.0 i Table 3.3: Context based table structure <gt:unigram>, since we can also store the probability of a big in <gt:unigram> which is redundant with row a column <gt:big>. //k is the ngram, c is the probability for each ngram (k, c) { context=k[length-1]; if k.length == 1 column="gt:unigram"; else{ word=k[length-1]; context=k[0..length-2]; column="gt:word"; output.collect(context, (column, c) ); Since there may be many columns that only appear once and have the same value, typically seen in higher order ngrams, possible compression can be made to combine these columns together, reducing the column splits Half Ngram Based Structure For the previous two structures, we can get either large number of rows, or large number of columns. So there is a possible trade off between the rows and the columns. We can combine the word based and context based structure together, balancing the

29 Chapter 3. Estimating Language Model using Map Reduce 23 number of rows and columns. Our method is to split the n grams into n/2 grams, and use n/2 gram as the row key, the rest n/2 gram as the column label. For example, for a 4-gram model (w 1,w 2,w 3,w 4 ), the row key is (w 1 w 2 ), and the column is <gt:w 3 w 4 >. An example for 4-gram model is illustrated as Table 3.4: key column family:label gt:unigram gt:a gt:big gt:house gt:big house gt:new house.. a a big buy buy a i Table 3.4: Half ngram based table structure For higher order ngrams, this structure can reduce lots of rows and insert them as columns. Theoretically the costs of splitting ngram into word-context and contextword are identical, but n/2 gram-n/2 gram will require a bit more parsing. Similar to previous method, the process can be written as: //k is the ngram, c is the probability for each ngram (k, c) { half = int(length/2); context = k[0..half]; if k.length == 1 column="gt:unigram"; else{ word=k[half..length-1]; column="gt:word"; output.collect(context, (column, c) );

30 Chapter 3. Estimating Language Model using Map Reduce Integer Based Structure Instead of store all the strings, we can also convert all the words into integers, and store the integers in table. Extra steps are needed to convert each unigram into an unique integer number and keep the convert unigram-integer map in the distributed filesystem. The advantage of using integers is the size will be smaller than strings, because of storing a long length string with a single integer number. But on the other hand we need another encoding/decoding process to do the converting, and this results in more computational time. This method is a trade off between computational time and the storage size. Also this structure can be combined with previous methods for better compressions. For the simple ngram per row scheme, integer based structure can be illustrated as Table 3.5: unigram integer a 1 big 2 house 3 buy 4 key column family:label (gt:prob) Table 3.5: Integer based table structure Notice that if we store a big as 12, it might conflict with another word with number 12 in the convert map. So we have to add a space between the numbers, similar to the ngram strings. We need an extra MapReduce job to reprocess the raw counts, convert unigram into integer, and store in a table:

31 Chapter 3. Estimating Language Model using Map Reduce 25 //extra step to convert unigram into unique integer //input key is the ngram, value is the raw counts int id; column= convert:integer ; map(text k, IntWritable v, OutputCollector<Text, MapWritabe> output) { if (k.length==1) { id++; output.collect(k, (column, id));... Then we can query the convert table to store integer instead of strings: column= convert:integer ; //k is the ngram, c is the probability for each ngram (k,c) { for each k[i] in k //query the convert table or file, //find the integer for each k[i] intk[i]=get(k[i], column); column="gt:prob"; row = combineinteger(intk); output.collect(context, (column, c) ); 3.6 Direct Query The next process is the testing function. The actual back-off is performed here in the query. Based on the back-off model, for each testing ngram, we need to query the ngram, if not found, then find n-1 gram, until we reach unigram. For different table structures, we just need to generate different row and column names. The advantage of using MapReduce for the testing is that, we can put multiple testing texts into HDFS, and a MapReduce job can process all of them to generate the raw counts, just the same as we have done in the training process, then for each ngram with its counts,

32 Chapter 3. Estimating Language Model using Map Reduce 26 we directly estimate the probability using back-off model, and multiply by the counts. In such a method, each different ngram is processed only once, which speeds up the whole process especially for lower order ngrams. We call this method Direct Query because we query each ngram directly from Hbase table, so the more testing ngrams we have, the more time it will cost. Also the perplexity of the estimation is computed and collected as an evaluation value for the language model. An example using the simple ngram based table structure is: //the job for raw counts //input key is the docid, value is one line in the text map(longwritabe docid, Text line, OutputCollector<Text, IntWritable> output) { words[] = line.split(blank space or punctuation) for i=1..order for k = 0..(words.length - i) output.collect(words[k..(k+i-1)], 1) reduce(text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output ) {... //the job for estimating probability //the input key is the ngram, value is the raw counts column= gt:prob ; map(text k, IntWritable v, OutputCollector<Text, FloatWritable> output) { // a back-off calculation row=k.tostring; while (finished==false) { value=table.get(row, column); if (value!= null) { //find the probability prob=alpha*value;

33 Chapter 3. Estimating Language Model using Map Reduce 27 finished=true; else { //unseen unigram if k.length==1 { prob=unseen_prob; finished=true; else { //back-off to n-1 gram ngram=row; row=ngram[0..length-2]; alpha=alpha*0.4; finished=false; //now we have the probability in prob count+=v; if prob >1.0 prob = 1.0 H+=v*log(prob)/log(2); output.collect(k, prob); //compute the total perplexity void close(){ perplexity = pow(2.0, -H / count);... Figure 3.8 is an example of this direct query process. If the estimated probability is above 1.0, we need to adjust it down to 1.0. The map function collects the total counts and computes the total perplexity in the overrided close method. As a reference, the probability for each ngram is collected in HDFS as the output of final reduce job.

34 Chapter 3. Estimating Language Model using Map Reduce 28 Figure 3.8: Direct Query 3.7 Caching Query There is a little more we can do to speed up the query process. Considering queries for different ngram that share the same n-1 gram, in a back-off model we query the ngram first, then if not found, we move on to n-1 gram. Suppose we need to back-off for each ngram, the n-1 gram will be requested for multiple times. Here is where the caching steps in. For each back-off step, we store the n-1 gram probability as HashMap in memory inside the working node. Everytime when node comes to a new n-1 backoff query, it will first look up in the HashMap, if not found, then ask the Hbase table, and add the new n-1 gram into HashMap. The simplified scheme can be written as:... //k is the ngram HashMap cache; while not finished { prob = table.get(k, column); if prob!= null found probability, finished

35 Chapter 3. Estimating Language Model using Map Reduce 29 else { let row be the n-1 gram from k if cache.exists(row) prob = cache.get(row) else { prob = table.get(row, column); if prob!=null { found probability, finished if number of cache.keys < maxlimit cache.add(row, prob); else cache.clear(); cache.add(row, prob);... Figure 3.9 is an example of this caching query process. We don t need to store probabilities for ngrams, only the n-1 grams. Also there is a maximum limit of the number of keys in the HashMap. We can t store all the n-1 grams into HashMap, otherwise it will become huge and eat up all the working node s memory. So we only store up to maxlimit n-1 grams, and when counts are over the limit, the previous HashMap is dropped and filled in new items. It is like an updating process, and another alternative is to delete the first key in HashMap and push in new one. Above methods establish the whole process of distributed language model training and testing. The next chapter describes the evaluation methods we use.

36 Chapter 3. Estimating Language Model using Map Reduce 30 Figure 3.9: Caching Query

37 Chapter 4 Evaluation Methods Our major interest is to explore the computational and storage cost building distributed language model using Hadoop MapReduce framework and Hbase distributed database. The evaluation focuses on the comparison of time and space using different table structures. Also as language model the perplexity for a testing set is evaluated and compared with traditional language modeling tools. 4.1 Time and Space for Building LM There are two major processes we can evaluate, the training process and testing process. Because we are experimenting different table structures, we split the training process into two steps: the first one is generating raw counts and collecting Good- Turing smoothing parameters, the second one is generating table. Apparently the first step is the same for all different tables, so we can focus on the second step for comparison. The comparison of time cost is based on the average value of program running time taken from multiple runs to avoid deviations. Network latency and other disturbance may vary the result, but the error should be remaining in 1-2 minutes level. The program calculates its running time by comparing the current system time before and after the MapReduce job is submitted and executed. Each table structure is compared with the same ngram order, also different ngram order for one table is compared to see the relationship between the order and the time cost in each method. To collect the size of the language model, we can use command-line programs provided in Hadoop framework. Because the Hbase table is actually stored in HDFS filesystem, we can directly calculate the size of the directory created by Hbase. Typ- 31

38 Chapter 4. Evaluation Methods 32 ically, Hbase creates a directory called /hbase in HDFS, and creates a sub directory with the name of the table, e.g. if we create a table named trigram, then we can calculate the size of the directory /hbase/trigram as the size of the table. It is not a perfectly accurate estimation of the table data because the directory contains other meta info files but since these files are usually quite small, we can estimate them together. Another point of view is that the table has two parts: the data and the info, so we can calculate these two parts together. 4.2 LM Perplexity Comparison For a language model, the perplexity for testing set is a common evaluation to see how good the model is. The perplexity means the average choice for each ngram. Generally speaking, the better model it is, the lower perplexity it will get. The order of ngram also affects the perplexity for the same model. For a normal size training set, higher order ngram model will always get lower perplexity. Meanwhile, the more training set we have, the better model we will get, so the perplexity will become lower. We can also compare the distributed language model with traditional language modeling tools like SRILM [6]. SRILM is a rich functional package for building and evaluating language models. The package is written in C++, and because it runs locally in single computer, the processing speed is fast. The shortcoming of SRILM is it will eat up memory and even overflow when processing huge amount of data. Still we can compare SRILM with our distributed language model on the same training set which is not so huge. The smoothing methods need to be nearly the same, e.g. Good-Turing smoothing. The specific parameters may vary, but the similar smoothing method can show whether the distributed language model is stable enough compared with traditional language modeling package. Applying these evaluation methods, the experiments and results are illustrated in the next chapter.

39 Chapter 5 Experiments The experiments are done in a cluster environment with 2 working nodes and 1 master server. The working nodes are running Hadoop, HDFS and Hbase slaves, and the master server controls all of them. For each experiment, we repeat three times and choose the average value as the result. The results are shown as figures and tables for different ngram orders of all the table structures, the time and space cost, also the perplexity for testing set. We first compare the size and number of unique ngrams for training and testing data taken from a 100 million-word corpus. The time and space cost for generating raw counts, generating table structures, finally the testing queries are compared separately. We also compare the unigram, bigram and trigram models for each step. All of the 5 table structures we proposed are compared on time cost for both training and testing process, as well as the table size of these different structures. Finally the perplexity for each ngram order is compared with SRILM, and the model data size from SRILM is calculated as a reference. 5.1 Data The data is taken from British National Corpus, which is around 100 million words. We choose some random texts from the corpus as the testing set. The testing set is about 35,000 words. All the remaining texts are used as training set. The corpus is constructed as one sentence per line. For completeness we include all the punctuation in the sentence as the part of ngam, e.g. <the house s door> is parsed as <the> <house> < s> <door>. The approximate words and file size for training and testing data is: 33

40 Chapter 5. Experiments 34 training data testing data tokens 110 million 40,000 data size 541MB 202KB Table 5.1: Data Notice that the tokens contain both the words and the punctuation, slightly increasing the total number of ngrams. 5.2 Ngram Order We choose to train up to trigram model for the training process. A counts pruning number of 1 is applied for all the models. Table 5.2 shows the number of unique ngrams for each order on the training and testing data set. Figure 5.1 draws the unique ngram number for training and testing data. We can see from the figure that the training data has a sharper line. Considering the different number of tokens, it shows that larger data set results in more trigrams, meaning more varieties. training data testing data tokens 110 million 40,000 unigram 284,921 7,303 bigram 4,321,467 25,862 trigram 9,090,713 33,120 Table 5.2: Ngram numbers for different order For all the different table structures, the steps 1 & 2 of raw counting and Good- Turing parameter estimating are identical. So we can evaluate this part first, apart from choosing table types. The time and space costs are compared in Table 5.3 and Table 5.4. For testing process only the raw counting step is required. All the raw counting outputs are stored in HDFS as compressed binary files. The trend line is shown in Figure 5.2. There is a big difference between the two lines. As we can see from Table 5.3, when the tokens are relatively small, the processing time is nearly the same, but when the tokens become large, the difference between unigram, bigram and trigram model is largely increasing. Another aspect affecting the

41 Chapter 5. Experiments 35 Figure 5.1: Unique ngram number Figure 5.2: Time cost in raw counting

42 Chapter 5. Experiments 36 training data testing data raw counting & parameter estimating raw counting tokens 110 million 40,000 unigram 15 min 38 sec 19 sec bigram 41 min 6 sec 20 sec trigram 75 min 41 sec 22 sec Table 5.3: Time cost in raw counting time cost is that during the training process, a second MapReduce job is needed to estimate the count of counts for Good-Turing smoothing. For this job, only one reduce task is launched in the working nodes, producing a single output file. It is easy to know that as the ngram order increases, the input records also become larger, requiring more processing time. training data testing data raw counting & parameter estimating raw counting tokens 110 million 40,000 unigram 1.5 MB 35 KB bigram 22 MB 141 KB trigram 49 MB 234 KB Table 5.4: Space cost in raw counting Table 5.4 shows the size of raw counts file for each ngram order. We can see the size is increasing rapidly, implying the size of the table will also increase sharply. Notice that in Table 5.4 the size is only for each ngram order, but in our table, we need to store all the orders from unigram, bigram up to ngram. So the table s size is the total sum, further increasing the number. Figure 5.3 suggests that training and testing data have the similar up going trend line, meaning the space cost is less relevant with the number of tokens, and it is like a monotone increasing function with the ngram order.