Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta, Egypt tahany@f-eng.tanta.edu.eg Hatem M. Abdullkader Information System department, Faculty of Computers and Information- Menofia University, Egypt Hatem683@yahoo.com Alsayed Abdelhameed Sallam Computer and Automatic Control Dept- Faculty of Engineering- Tanta University Tanta, Egypt Sallam@f-eng.tanta.edu.eg Abstract N-grams are a building block in natural language processing and information retrieval. It is a sequence of a string data like contiguous words or other tokens in text documents. In this work, we study how N-gram can be computed efficiently using a MapReduce for distributed data processing and a distributed database named Hbase This technique is applied to construct the training and testing processes using Hadoop MapReduce framework and Hbase. We proposed to focus on the time cost and storage size of the model and exploring different structures of Hbase table. By constructing and comparing a different table structures on training 1 million words for unigram, bigram and trigram models we suggest a table based on half ngram structure is a more suitable choice for distributed language model. The results of this work can be applied in the cloud computing and other large scale distributed language processing areas. Keywords Natural Language Processing, N-gram model, MapReduce, Hadoop framework, Hbase tables. I. INTRODUCTION N-gram language model is widely used in different fields like machine translation and speech recognition in natural language processing NLP. N-gram model is trained using large sets of unlabelled texts to estimate a probability model. Generally, when data increased the model will be more robust. We can get huge training texts from corpus or web, the storage space and computational power of single computer are both limited. In the distributed NLP, most of the previous works mainly concentrated on establishing the methods for model training and testing, but there is no discussion about the storage structure. The efficient construction and storage of the model in a distributed cluster environment is the main objective of this work. Also, the main interest is to explain the capability of constructing a different database table structures with language models. Organization. Section II the architecture reviews and the concepts of N-gram model. Sections III introduce the definition of Hadoop MapReduce framework and Hbase distributed database. In Section IV gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed. In Section V describes the evaluation method on the application we use. In Section VI shows all the experiments and results. Finally, Section VII the conclusion. II. N-GRAM Ngram model is the leading method for statistical language processing [1]. The model tries to expect the next word w n depending on the given n-1 words context w 1 ;w 2 ;.;w n-1 by estimating the probability function: P(w n w 1,.,w n-1 ). (1) A Markov assumption is applied that only the prior local context affects on the next word [2]. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: P =. (2) where f (w 1,.,w n ) is the counts of how many times we seen the sentence w 1,.,w n in the corpus. To adjust the empirical counts collected from training texts to the expected counts for N-grams we use one important aspect which is the count smoothing. If there are N-grams don t appear in the training set, but are seen in testing text, the probability would be zero according to the maximum likelihood estimation. So we need to find a better estimation for the expected counts. A Good-Turing smoothing is based on the count of ngram counts and used when the probability of frequency is zero. The expected counts are adjusted with the formula: C = + 1. (3) where c is the actual counts, Nc is the count of N-grams that occur exactly c times. For words that are seen frequently, c can be large numbers, and Nc+1 is likely to be zero, in this case derivations can be made to look up Nc+2, Nc+5, find a nonzero count and use that value instead. If there is a gram that doesn t appear in training set, we estimate the probability of n-1 gram instead: PDC-58

= (4) where w w is the back-off weight. Modern backoff smoothing techniques like Kneser-Ney smoothing [2] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate if the language model is good, we evaluate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one N-gram prediction: =,,,. (5) Perplexity is defined based on cross entropy: = 2. (6) III. HADOOP MAPREDUCE FRAMEWORK AND HBASE Hadoop is an open source framework for coding and running distributed applications that process a massive amount of data [3]. MapReduce is a programming model which supports the framework model [4]. When loading, the file is dividing into multiple data blocks with the same size, typically 64MB blocks named FileSplits, and to guarantee fault-tolerance each block is triplicated [5]. A general MapReduce architecture can be illustrated as Figure 1. 2 3 2 M A P 3 M A P 2 3 1 1 Reduce Fig. 1 MapReduce Architecture. 1 Results For input text files, each line is parsed as one string which is the value. For output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs are text files, and what we will finally get from Hadoop are ngram/probability pairs [1]. M AP Reduce Hbase is the Hadoop database, designed to provide random, realtime read/write access to very large tables, billions of rows and millions of columns. Hbase is an opensource, distributed, versioned, column-oriented store modeled after Google's Bigtable [6]. To store the N-gram probabilities in database table, we can make use Hbase table structure such in state of parsing files. So the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. Hbase stores data in labeled tables. An imaginable view of Hbase table can be illustrated below: A format of the column name is a string with the <family>:<label> form, where <family> is a column family name assigned to a group of columns, and label can be any string. The concept of column families is that only administrative operations can modify family names, but user can create arbitrary labels on demand [1]. There is a relationship between Hadoop, Hbase and H can be illustrated as Figure 2. HBase H Application Programming Interface API output input Hadoop Fig. 2 The relationship between Hadoop, Hbase and H. IV. VALUATION N-GRAM MODEL BASED ON MAPREDUCE The distributed training process explained in Google s work [7] is split into three phases: convert words to ids, show ngrams per sentence, and compute the probabilities of ngram. We proposed an additional step to calculate the count of ngram counts by using Good-Turing Smoothing estimation. So, there are four steps in the training process shows in figure 3. The testing process is also using MapReduce, acting like a distributed decoder, so we can process multiple testing texts together. The target is to focus on the comparison of time and space using different table structures. So we specified which type of a table structure is more robust than anthers. Because we are using different table structures, we partition the training process into two phases: the first one is generating word counts and collecting Good-Turing smoothing parameters, the second one is generating Hbase table. The first step is the same for all different Hbase tables, so we focus only on the second step for comparison. PDC-59

combine and reduce function are similar from the previous step. The MapReduce job is explained in figure 4. Fig. 3 The flow chart of the training process. A. Part one: 1) Generate word count In this step we find the total number of each word in the different n-grams. The input of the map function is one text line. The key is the doc-id, and the value is the text. For each line, it is split into all the unigrams, bigrams, and trigrams up to n-grams [1]. Then a combiner function sums up all the values for the same key within Map tasks. And a reduce function same as the combiner collects all the output from combiner, sums up values for the same key. The final key is the same with map function s output, which is the ngram, and the value is the raw counts of this ngram throughout the training data set. Figure 4 shows the mapcombine-reduce functions process given some sample text. Fig. 4 Generate word count. 2) Generate count of n-gram counts We need to collect all the count of counts for unigram, bigram, trigram up to ngram to be able for calculate Good- Turing Smoothing count. We emits one count of the word count along with the ngram order in the map function.the combine and reduce function merge all the count of counts with the same key. Single file is enough to store all the count of counts because the final output is small. The Fig. 5 Generate count of n-gram counts. 3) Generate Good-Turing Smoothing count In this step we need to calculate the smoothed count only for the words that are never seen before and also for the words have one frequency. We use equation 3 to compute the smoothed count. The inputs of map function are a word count and a count of n-gram counts. The count of counts is stored in hash file. Figure 6 shows the flow chart of this step. Fig. 6 Generate Good-Turing Smoothing Count 4) Generate n-gram probability To estimate the probability of one n-gram,,,, we need the counts of,,, and,,,. We mean if we what to know the probability of any word we must know the count of the previous words that appears with this word. Only we need the history of the words when we deal with the highest n-gram order, from bigram up to n- gram. The zero frequency and unigram only needs the Good- Turing Smoothing to evaluate the probability. Now the reduce function can receive both the counts of the context and the counts with current word. To computes the conditional probability for ngram based on the equation 2. After Good-Turing smoothing, some counts might be quite small, so the probability might be over 1.. In this case we need to adjust it down to 1. by using a back-off model [8]. In this step, we get all the probabilities for ngrams. So the next important step is to store these probabilities in the Hbase table. B. Part two: Generate Hbase tables Hbase can be used as the data input/output sink in Hadoop MapReduce jobs [9]. There are essential modifications PDC-6

needed in the Hbase tables. Because writing Hbase table is row based, each time we need to generate a key with some context as the column [1]. There are several different structure choices when building tables. One of structure is a simple scheme of each ngram per row. Also there are structured rows based on current word and context. Two major points are considered: the speed in write/query and the table storage size [11]. A) N-gram based structure Each n-gram is stored in a separate row, so the table has a flat structure with one single column. The key is n-gram itself, and the column stores its probability. Table 1 is an example of this structure. This structure is easy to implement and maintain. But the table is not capable to represent all the same probabilities in an efficient way. TABLE 1 N-GRAM BASED TABLE STRUCTURE B) Current word based structure For all the ngrams that share the same current word, we can store them in one row with the key. In this table we reduce the number of rows, and expand all contexts into separate columns. From the view of distributed database, the data is stored sparsely. But in the data structure, it is still uncompressed. Table 2 shows this table structure. We also need to collect the probability of unigram for each current word and store it in a separate column. TABLE 2 CURRENT WORD BASED STRUCTURE key gt:unigram gt:there Column family : label gt:there is gt:there is a gt:car there.13 is.5.66.3 a.15.76.3.76 Key Column family (gt:prob) There.43 There is.76 There is a.33 car.11.. C) Context word based structure Similar to current word based structure, we can also use the context as the key for each row, and store all the possible following words in separate columns with the format <column family:word>. There are a lot of rows in this structure than the previous structure. To avoid redundancy, only we stored unigram keys with their own probabilities in <gt:unigram>. May be there are many columns that only appear once and have the same value. In higher order ngrams, possible compression can be made to combine these columns together, reducing the column splits. Table 3 shows example of this structure. TABLE 3 CONTEXT WORD BASED STRUCTURE D) Half n-gram based structure We can aggregate the word based and context based structure together, balancing the number of rows and columns. Our method is to partitioning the n grams into n=2 grams, and make n=2 gram as the row key, the rest n=2 gram as the column label. This structure can reduce lots of rows and insert them as columns. Table 4 shows this structure. TABLE 4 HALF N-GRAM BASED STRUCTURE key key Column family : label gt:unigram gt:there gt:is gt: a gt:car there.13.64.22 There is.76.37 a.15.53.76 a car.65.. Column family : label gt:unigram gt:there gt:is gt: a gt:white car gt:red car there.13.64 There is.76.37.87 a.15.53.76.65 a car.11.65.... V. EVALUATION METHOD OF APPLICATION Our major interest includes two parts: the first part about how to achieve the evaluation process on the training process, the second part how to create Hbase tables. We use Hadoop framework to run 1 million words from British National Corpus. The first part on the training process is generating word counts, count of n-gram counts and collecting Good-Turing smoothing parameters. The result in this part will be discussed on the next section. The second part is generating different Hbase tables to store the n-gram probability. Also make a comparison between the different tables. The comparison of time cost is based on the average value of program running time, taken from multiple runs to avoid deviations. The next section explains the experiment results. The data of corpus is constructed as one sentence per line. For completeness we include all the punctuation in the sentence as the part of n-gram, e.g. <the house s door> is parsed as <the> <house> < s> <door>. Each table structure is compared with the same n-gram order; also different n-gram order for one table is compared to see the relationship between the order and the time cost in each method. PDC-61

VI. EXPERIMENT AND RESULTS The experiments are done in a single node Hadoop cluster environment. The working nodes are running Hadoop, H and Hbase tables. We repeat the experiment three times and choose the average value as the result. The results are shown as figures and tables for different ngram orders of all the table structures. The time and space cost also explained in figures and tables for different ngram orders. The data is taken from British National Corpus, which are around 15 million words. The file size for training data is around 75 MB, which contain both the words and the punctuation. We choose to train up to 4-gram model for the training process. The unique ngrams number for each order on the training data set is explained in Table 5. TABLE 5 NUMBERS OF NGRAMS FOR DIFFERENT ORDERS We can see when the token data are relatively small, the time cost is nearly the same, but when the tokens become large, the difference between unigram, bigram and trigram model is largely increasing. Another affecting observed, when the ngram order increases, the input records also become larger, requiring more processing time. Figure 6 shows the time cost in word counts and parameter estimation of Good-Turing Smoothing for different n-gram orders. time cost (sec) 5 4 3 2 1 unigram bigram trigram 4-gram n-gram order Fig. 6 Time cost of word count and parameter estimation. Figure 5 draws the unique ngram number for training data. We can see from the figure that the training data has a sharper line. When the order of n-gram increased the data token increased, which means more varieties. counts of each gram 12,, 1,, 8,, 6,, 4,, 2,, unigram bigram trigram 4-gram n-gram order Fig. 5 N-gram unique numbers. Now we evaluate the first part, word count and Good- Turing Smoothing parameters, for each different table structure. Table 6 shows the time and space costs of part one. All the word counting outputs are stored in H as compressed binary files. TABLE 6 TIME COST IN WORD COUNT AND PARAMETER ESTIMATION Also in Table 6 the size of word counts file for each ngram order is increasing rapidly, implying the size of the table will also increase sharply. Figure 7 shows the space size of 15 million words token. space size (KB) 8 6 4 2 unigram bigram trigram 4-gram n-gram orders Fig. 7 Space size of word count and parameter estimation. In the second part, we create Hbase tables in four different structures and compare time and space cost in different n-gram orders. We generate the tables with unigram, bigram, trigram and 4-gram model for each of the four table structures. Table 7 and 8 show the time and space cost for training process for each order in generating table. TABLE 7 TIME COST IN DIFFERENT TABLE STRUCTURES PDC-62

TABLE 8 SPACE COST IN DIFFERENT TABLE STRUCTURES Based on the size of each directory of the table in H we calculate the size of different table structure. There are errors caused by network latency, transmission failure, I/O delay and error, the time cost may vary with an error range of 1 1.5 minutes. Figure 8 and figure 9 compares the time and space cost for these four table types. All the tables are constructed with same compression option in Hbase. In figure 8 the unigram models in all of the four tables behave similarly, because the structures are identical. For bigram models, the training time keeps almost the same. But in the trigram and 4-gram the time cost will be increased because the data increased. Time cost (sec) Fig. 8 Time cost in different tables. In figure 9, the space cost keeps reducing through type 3 but type 2 slightly increases in the space cost. In 4-gram the space cost decreased in type 3 and almost the same on the other types. Type 2 and 4 all reduce the time cost, type 2 have little increase in the space cost but type 3 reduce the space cost. Spase cost (KB) 3 25 2 15 1 5 25 2 15 1 5 Type of Hbase table unigram bigram trigram 4-gram unigram bigram trigram 4-gram VII. CONCLUSION This bibliographical study focuses on the management of natural language process based on Hadoop framework. From our search result; a good choice for distributed language model using Hadoop and Hbase table is the half ngram based structure. The n-gram based structure is very simple to take advantage of distributed features when considering the time and space cost. When the n-gram increased the time and space cost will be increased also. We find that context based structure is better than word based context. This is because the rows are less expensive than more columns. When we have fewer columns that means less column splits and this will reduce the I/O requests. The half ngram based structure is a best choice compared with context and word structure. If we expand the cluster to more machines the difference of time cost will be smaller. We can make a good model using a table structure which holds more data. As a future work the experiments can be expanded to a cluster with more machines, which increase the computational power and reduce the time cost. REFERENCES. [1] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation, In Proceedings of the 27 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858 867, 27. [2] R. Kneser and H. Ney, Improved backing-off for m-gram language modeling, Acoustics, Speech, and Signal Processing, 1995. ICASSP- 95., 1995 International Conference on, 1:181 184 vol.1, 9-12 May 1995. [3] C. Lam, Hadoop in Action, Manning Publications Co. Copyright 211, pages. 3-37. [4] K. Lee, H. Choi, B. Moon, Parallel Data Processing with MapReduce: A Survey, SIGMOD Record, Vol. 4, No. 4, December 211. [5] T. M. Allam; H. M. Abdullkader; A. A. Sallam, Cloud Data Management based on Hadoop Framework, The International Conference on Computing Technology and Information Management, Dubai, 214 (SDIWC), April 214, pp 455-46. [6] Apache HBase project: http://hbase.apache.org [7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, In OSDI 6, pages 25 218, 26. [8] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language modeling, Acoustics, Speech and Signal Processing, 27. ICASSP 27. IEEE International Conference on, 4:IV 37 IV 4, 15-2 April 27. [9] Y. Zhang, A. S. Hildebrand, and S. Vogel, Distributed language modeling for nbest list re-ranking, In Proceedings of the 26 Conference on Empirical Methods in Natural Language Processing, pages 216 223, Sydney, Australia, July 26. [1] Y. Harter, D. Borthakur, S Dong, A. Aiyer, L. Tang, Analysis of H Under HBase: A Facebook Messages Case Study, University of Wisconsin, Madison, 212. [11] Y. Xiaoyang, Estimating language model using hadoop and Hbase, Master of Artificial Intelligence, University of Edinburgh, 28. Type of Hbase table Fig. 9 Space cost in different tables. PDC-63