Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables



Similar documents
Perplexity Method on the N-gram Language Model Based on Hadoop Framework

Estimating Language Models Using Hadoop and Hbase

Chapter 7. Language models. Statistical Machine Translation

CS 533: Natural Language. Word Prediction

Cloudera Certified Developer for Apache Hadoop

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Large Language Models in Machine Translation

International Journal of Advance Research in Computer Science and Management Studies

CSCI 5417 Information Retrieval Systems Jim Martin!

Language Modeling. Chapter Introduction

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

American International Journal of Research in Science, Technology, Engineering & Mathematics

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Big Data and Apache Hadoop s MapReduce

BIG DATA IN SCIENCE & EDUCATION

Log Mining Based on Hadoop s Map and Reduce Technique

What is Analytic Infrastructure and Why Should You Care?

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data With Hadoop

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Introduction to Parallel Programming and MapReduce

A programming model in Cloud: MapReduce

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Hadoop IST 734 SS CHUNG

Introduction to Hadoop

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Survey on Load Rebalancing for Distributed File System in Cloud

Application Development. A Paradigm Shift

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

How To Handle Big Data With A Data Scientist

Lecture Data Warehouse Systems

On the Varieties of Clouds for Data Intensive Computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Infrastructures for big data

Integrating Big Data into the Computing Curricula

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Introduction to Hadoop

Hadoop and Map-reduce computing

A High-availability and Fault-tolerant Distributed Data Management Platform for Smart Grid Applications

Comparison of Different Implementation of Inverted Indexes in Hadoop

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Apache HBase. Crazy dances on the elephant back

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Cassandra A Decentralized, Structured Storage System

Criteria to Compare Cloud Computing with Current Database Technology

Internals of Hadoop Application Framework and Distributed File System

Query and Analysis of Data on Electric Consumption Based on Hadoop

Word Completion and Prediction in Hebrew

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

The Hadoop Framework

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Comparing SQL and NOSQL databases

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Can the Elephants Handle the NoSQL Onslaught?

MapReduce With Columnar Storage

Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Slave. Master. Research Scholar, Bharathiar University

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Hypertable Architecture Overview

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data and Scripting map/reduce in Hadoop

JackHare: a framework for SQL to NoSQL translation using MapReduce

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Hadoop and Map-Reduce. Swati Gore

Task Scheduling Algorithm for Map Reduce To Control Load Balancing In Big Data

Strategies for Training Large Scale Neural Network Language Models

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Distributed Computing and Big Data: Hadoop and MapReduce

Data-Intensive Computing with Map-Reduce and Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Policy-based Pre-Processing in Hadoop

Comparative analysis of Google File System and Hadoop Distributed File System

Efficient Data Replication Scheme based on Hadoop Distributed File System

Distributed Framework for Data Mining As a Service on Private Cloud

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Table A Distributed Storage System For Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Transcription:

Managed N-gram Language Model Based on Hadoop Framework and a Hbase Tables Tahani Mahmoud Allam Assistance lecture in Computer and Automatic Control Dept - Faculty of Engineering-Tanta University, Tanta, Egypt tahany@f-eng.tanta.edu.eg Hatem M. Abdullkader Information System department, Faculty of Computers and Information- Menofia University, Egypt Hatem683@yahoo.com Alsayed Abdelhameed Sallam Computer and Automatic Control Dept- Faculty of Engineering- Tanta University Tanta, Egypt Sallam@f-eng.tanta.edu.eg Abstract N-grams are a building block in natural language processing and information retrieval. It is a sequence of a string data like contiguous words or other tokens in text documents. In this work, we study how N-gram can be computed efficiently using a MapReduce for distributed data processing and a distributed database named Hbase This technique is applied to construct the training and testing processes using Hadoop MapReduce framework and Hbase. We proposed to focus on the time cost and storage size of the model and exploring different structures of Hbase table. By constructing and comparing a different table structures on training 1 million words for unigram, bigram and trigram models we suggest a table based on half ngram structure is a more suitable choice for distributed language model. The results of this work can be applied in the cloud computing and other large scale distributed language processing areas. Keywords Natural Language Processing, N-gram model, MapReduce, Hadoop framework, Hbase tables. I. INTRODUCTION N-gram language model is widely used in different fields like machine translation and speech recognition in natural language processing NLP. N-gram model is trained using large sets of unlabelled texts to estimate a probability model. Generally, when data increased the model will be more robust. We can get huge training texts from corpus or web, the storage space and computational power of single computer are both limited. In the distributed NLP, most of the previous works mainly concentrated on establishing the methods for model training and testing, but there is no discussion about the storage structure. The efficient construction and storage of the model in a distributed cluster environment is the main objective of this work. Also, the main interest is to explain the capability of constructing a different database table structures with language models. Organization. Section II the architecture reviews and the concepts of N-gram model. Sections III introduce the definition of Hadoop MapReduce framework and Hbase distributed database. In Section IV gives the detail of our methods, illustrating all the steps in MapReduce training and testing tasks, as well as all the different table structures we proposed. In Section V describes the evaluation method on the application we use. In Section VI shows all the experiments and results. Finally, Section VII the conclusion. II. N-GRAM Ngram model is the leading method for statistical language processing [1]. The model tries to expect the next word w n depending on the given n-1 words context w 1 ;w 2 ;.;w n-1 by estimating the probability function: P(w n w 1,.,w n-1 ). (1) A Markov assumption is applied that only the prior local context affects on the next word [2]. The probability function can be expressed by the frequency of words occurrence in a corpus using Maximum Likelihood Estimation without smoothing: P =. (2) where f (w 1,.,w n ) is the counts of how many times we seen the sentence w 1,.,w n in the corpus. To adjust the empirical counts collected from training texts to the expected counts for N-grams we use one important aspect which is the count smoothing. If there are N-grams don t appear in the training set, but are seen in testing text, the probability would be zero according to the maximum likelihood estimation. So we need to find a better estimation for the expected counts. A Good-Turing smoothing is based on the count of ngram counts and used when the probability of frequency is zero. The expected counts are adjusted with the formula: C = + 1. (3) where c is the actual counts, Nc is the count of N-grams that occur exactly c times. For words that are seen frequently, c can be large numbers, and Nc+1 is likely to be zero, in this case derivations can be made to look up Nc+2, Nc+5, find a nonzero count and use that value instead. If there is a gram that doesn t appear in training set, we estimate the probability of n-1 gram instead: PDC-58

= (4) where w w is the back-off weight. Modern backoff smoothing techniques like Kneser-Ney smoothing [2] use more parameters to estimate each ngram probability instead of a simple Maximum Likelihood Estimation. To evaluate if the language model is good, we evaluate the cross entropy and perplexity for testing ngram. A good language model should give a higher probability for one N-gram prediction: =,,,. (5) Perplexity is defined based on cross entropy: = 2. (6) III. HADOOP MAPREDUCE FRAMEWORK AND HBASE Hadoop is an open source framework for coding and running distributed applications that process a massive amount of data [3]. MapReduce is a programming model which supports the framework model [4]. When loading, the file is dividing into multiple data blocks with the same size, typically 64MB blocks named FileSplits, and to guarantee fault-tolerance each block is triplicated [5]. A general MapReduce architecture can be illustrated as Figure 1. 2 3 2 M A P 3 M A P 2 3 1 1 Reduce Fig. 1 MapReduce Architecture. 1 Results For input text files, each line is parsed as one string which is the value. For output files, the format is one key/value pair per one record, and thus if we want to reprocess the output files, the task is working at the record pair level. For language model training using MapReduce, our original inputs are text files, and what we will finally get from Hadoop are ngram/probability pairs [1]. M AP Reduce Hbase is the Hadoop database, designed to provide random, realtime read/write access to very large tables, billions of rows and millions of columns. Hbase is an opensource, distributed, versioned, column-oriented store modeled after Google's Bigtable [6]. To store the N-gram probabilities in database table, we can make use Hbase table structure such in state of parsing files. So the data is indexed and compressed, reducing the storage size; tables can be easily created, modified, updated or deleted. Hbase stores data in labeled tables. An imaginable view of Hbase table can be illustrated below: A format of the column name is a string with the <family>:<label> form, where <family> is a column family name assigned to a group of columns, and label can be any string. The concept of column families is that only administrative operations can modify family names, but user can create arbitrary labels on demand [1]. There is a relationship between Hadoop, Hbase and H can be illustrated as Figure 2. HBase H Application Programming Interface API output input Hadoop Fig. 2 The relationship between Hadoop, Hbase and H. IV. VALUATION N-GRAM MODEL BASED ON MAPREDUCE The distributed training process explained in Google s work [7] is split into three phases: convert words to ids, show ngrams per sentence, and compute the probabilities of ngram. We proposed an additional step to calculate the count of ngram counts by using Good-Turing Smoothing estimation. So, there are four steps in the training process shows in figure 3. The testing process is also using MapReduce, acting like a distributed decoder, so we can process multiple testing texts together. The target is to focus on the comparison of time and space using different table structures. So we specified which type of a table structure is more robust than anthers. Because we are using different table structures, we partition the training process into two phases: the first one is generating word counts and collecting Good-Turing smoothing parameters, the second one is generating Hbase table. The first step is the same for all different Hbase tables, so we focus only on the second step for comparison. PDC-59

combine and reduce function are similar from the previous step. The MapReduce job is explained in figure 4. Fig. 3 The flow chart of the training process. A. Part one: 1) Generate word count In this step we find the total number of each word in the different n-grams. The input of the map function is one text line. The key is the doc-id, and the value is the text. For each line, it is split into all the unigrams, bigrams, and trigrams up to n-grams [1]. Then a combiner function sums up all the values for the same key within Map tasks. And a reduce function same as the combiner collects all the output from combiner, sums up values for the same key. The final key is the same with map function s output, which is the ngram, and the value is the raw counts of this ngram throughout the training data set. Figure 4 shows the mapcombine-reduce functions process given some sample text. Fig. 4 Generate word count. 2) Generate count of n-gram counts We need to collect all the count of counts for unigram, bigram, trigram up to ngram to be able for calculate Good- Turing Smoothing count. We emits one count of the word count along with the ngram order in the map function.the combine and reduce function merge all the count of counts with the same key. Single file is enough to store all the count of counts because the final output is small. The Fig. 5 Generate count of n-gram counts. 3) Generate Good-Turing Smoothing count In this step we need to calculate the smoothed count only for the words that are never seen before and also for the words have one frequency. We use equation 3 to compute the smoothed count. The inputs of map function are a word count and a count of n-gram counts. The count of counts is stored in hash file. Figure 6 shows the flow chart of this step. Fig. 6 Generate Good-Turing Smoothing Count 4) Generate n-gram probability To estimate the probability of one n-gram,,,, we need the counts of,,, and,,,. We mean if we what to know the probability of any word we must know the count of the previous words that appears with this word. Only we need the history of the words when we deal with the highest n-gram order, from bigram up to n- gram. The zero frequency and unigram only needs the Good- Turing Smoothing to evaluate the probability. Now the reduce function can receive both the counts of the context and the counts with current word. To computes the conditional probability for ngram based on the equation 2. After Good-Turing smoothing, some counts might be quite small, so the probability might be over 1.. In this case we need to adjust it down to 1. by using a back-off model [8]. In this step, we get all the probabilities for ngrams. So the next important step is to store these probabilities in the Hbase table. B. Part two: Generate Hbase tables Hbase can be used as the data input/output sink in Hadoop MapReduce jobs [9]. There are essential modifications PDC-6

needed in the Hbase tables. Because writing Hbase table is row based, each time we need to generate a key with some context as the column [1]. There are several different structure choices when building tables. One of structure is a simple scheme of each ngram per row. Also there are structured rows based on current word and context. Two major points are considered: the speed in write/query and the table storage size [11]. A) N-gram based structure Each n-gram is stored in a separate row, so the table has a flat structure with one single column. The key is n-gram itself, and the column stores its probability. Table 1 is an example of this structure. This structure is easy to implement and maintain. But the table is not capable to represent all the same probabilities in an efficient way. TABLE 1 N-GRAM BASED TABLE STRUCTURE B) Current word based structure For all the ngrams that share the same current word, we can store them in one row with the key. In this table we reduce the number of rows, and expand all contexts into separate columns. From the view of distributed database, the data is stored sparsely. But in the data structure, it is still uncompressed. Table 2 shows this table structure. We also need to collect the probability of unigram for each current word and store it in a separate column. TABLE 2 CURRENT WORD BASED STRUCTURE key gt:unigram gt:there Column family : label gt:there is gt:there is a gt:car there.13 is.5.66.3 a.15.76.3.76 Key Column family (gt:prob) There.43 There is.76 There is a.33 car.11.. C) Context word based structure Similar to current word based structure, we can also use the context as the key for each row, and store all the possible following words in separate columns with the format <column family:word>. There are a lot of rows in this structure than the previous structure. To avoid redundancy, only we stored unigram keys with their own probabilities in <gt:unigram>. May be there are many columns that only appear once and have the same value. In higher order ngrams, possible compression can be made to combine these columns together, reducing the column splits. Table 3 shows example of this structure. TABLE 3 CONTEXT WORD BASED STRUCTURE D) Half n-gram based structure We can aggregate the word based and context based structure together, balancing the number of rows and columns. Our method is to partitioning the n grams into n=2 grams, and make n=2 gram as the row key, the rest n=2 gram as the column label. This structure can reduce lots of rows and insert them as columns. Table 4 shows this structure. TABLE 4 HALF N-GRAM BASED STRUCTURE key key Column family : label gt:unigram gt:there gt:is gt: a gt:car there.13.64.22 There is.76.37 a.15.53.76 a car.65.. Column family : label gt:unigram gt:there gt:is gt: a gt:white car gt:red car there.13.64 There is.76.37.87 a.15.53.76.65 a car.11.65.... V. EVALUATION METHOD OF APPLICATION Our major interest includes two parts: the first part about how to achieve the evaluation process on the training process, the second part how to create Hbase tables. We use Hadoop framework to run 1 million words from British National Corpus. The first part on the training process is generating word counts, count of n-gram counts and collecting Good-Turing smoothing parameters. The result in this part will be discussed on the next section. The second part is generating different Hbase tables to store the n-gram probability. Also make a comparison between the different tables. The comparison of time cost is based on the average value of program running time, taken from multiple runs to avoid deviations. The next section explains the experiment results. The data of corpus is constructed as one sentence per line. For completeness we include all the punctuation in the sentence as the part of n-gram, e.g. <the house s door> is parsed as <the> <house> < s> <door>. Each table structure is compared with the same n-gram order; also different n-gram order for one table is compared to see the relationship between the order and the time cost in each method. PDC-61

VI. EXPERIMENT AND RESULTS The experiments are done in a single node Hadoop cluster environment. The working nodes are running Hadoop, H and Hbase tables. We repeat the experiment three times and choose the average value as the result. The results are shown as figures and tables for different ngram orders of all the table structures. The time and space cost also explained in figures and tables for different ngram orders. The data is taken from British National Corpus, which are around 15 million words. The file size for training data is around 75 MB, which contain both the words and the punctuation. We choose to train up to 4-gram model for the training process. The unique ngrams number for each order on the training data set is explained in Table 5. TABLE 5 NUMBERS OF NGRAMS FOR DIFFERENT ORDERS We can see when the token data are relatively small, the time cost is nearly the same, but when the tokens become large, the difference between unigram, bigram and trigram model is largely increasing. Another affecting observed, when the ngram order increases, the input records also become larger, requiring more processing time. Figure 6 shows the time cost in word counts and parameter estimation of Good-Turing Smoothing for different n-gram orders. time cost (sec) 5 4 3 2 1 unigram bigram trigram 4-gram n-gram order Fig. 6 Time cost of word count and parameter estimation. Figure 5 draws the unique ngram number for training data. We can see from the figure that the training data has a sharper line. When the order of n-gram increased the data token increased, which means more varieties. counts of each gram 12,, 1,, 8,, 6,, 4,, 2,, unigram bigram trigram 4-gram n-gram order Fig. 5 N-gram unique numbers. Now we evaluate the first part, word count and Good- Turing Smoothing parameters, for each different table structure. Table 6 shows the time and space costs of part one. All the word counting outputs are stored in H as compressed binary files. TABLE 6 TIME COST IN WORD COUNT AND PARAMETER ESTIMATION Also in Table 6 the size of word counts file for each ngram order is increasing rapidly, implying the size of the table will also increase sharply. Figure 7 shows the space size of 15 million words token. space size (KB) 8 6 4 2 unigram bigram trigram 4-gram n-gram orders Fig. 7 Space size of word count and parameter estimation. In the second part, we create Hbase tables in four different structures and compare time and space cost in different n-gram orders. We generate the tables with unigram, bigram, trigram and 4-gram model for each of the four table structures. Table 7 and 8 show the time and space cost for training process for each order in generating table. TABLE 7 TIME COST IN DIFFERENT TABLE STRUCTURES PDC-62

TABLE 8 SPACE COST IN DIFFERENT TABLE STRUCTURES Based on the size of each directory of the table in H we calculate the size of different table structure. There are errors caused by network latency, transmission failure, I/O delay and error, the time cost may vary with an error range of 1 1.5 minutes. Figure 8 and figure 9 compares the time and space cost for these four table types. All the tables are constructed with same compression option in Hbase. In figure 8 the unigram models in all of the four tables behave similarly, because the structures are identical. For bigram models, the training time keeps almost the same. But in the trigram and 4-gram the time cost will be increased because the data increased. Time cost (sec) Fig. 8 Time cost in different tables. In figure 9, the space cost keeps reducing through type 3 but type 2 slightly increases in the space cost. In 4-gram the space cost decreased in type 3 and almost the same on the other types. Type 2 and 4 all reduce the time cost, type 2 have little increase in the space cost but type 3 reduce the space cost. Spase cost (KB) 3 25 2 15 1 5 25 2 15 1 5 Type of Hbase table unigram bigram trigram 4-gram unigram bigram trigram 4-gram VII. CONCLUSION This bibliographical study focuses on the management of natural language process based on Hadoop framework. From our search result; a good choice for distributed language model using Hadoop and Hbase table is the half ngram based structure. The n-gram based structure is very simple to take advantage of distributed features when considering the time and space cost. When the n-gram increased the time and space cost will be increased also. We find that context based structure is better than word based context. This is because the rows are less expensive than more columns. When we have fewer columns that means less column splits and this will reduce the I/O requests. The half ngram based structure is a best choice compared with context and word structure. If we expand the cluster to more machines the difference of time cost will be smaller. We can make a good model using a table structure which holds more data. As a future work the experiments can be expanded to a cluster with more machines, which increase the computational power and reduce the time cost. REFERENCES. [1] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation, In Proceedings of the 27 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858 867, 27. [2] R. Kneser and H. Ney, Improved backing-off for m-gram language modeling, Acoustics, Speech, and Signal Processing, 1995. ICASSP- 95., 1995 International Conference on, 1:181 184 vol.1, 9-12 May 1995. [3] C. Lam, Hadoop in Action, Manning Publications Co. Copyright 211, pages. 3-37. [4] K. Lee, H. Choi, B. Moon, Parallel Data Processing with MapReduce: A Survey, SIGMOD Record, Vol. 4, No. 4, December 211. [5] T. M. Allam; H. M. Abdullkader; A. A. Sallam, Cloud Data Management based on Hadoop Framework, The International Conference on Computing Technology and Information Management, Dubai, 214 (SDIWC), April 214, pp 455-46. [6] Apache HBase project: http://hbase.apache.org [7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows,T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, In OSDI 6, pages 25 218, 26. [8] A. Emami, K. Papineni, and J. Sorensen. Large-scale distributed language modeling, Acoustics, Speech and Signal Processing, 27. ICASSP 27. IEEE International Conference on, 4:IV 37 IV 4, 15-2 April 27. [9] Y. Zhang, A. S. Hildebrand, and S. Vogel, Distributed language modeling for nbest list re-ranking, In Proceedings of the 26 Conference on Empirical Methods in Natural Language Processing, pages 216 223, Sydney, Australia, July 26. [1] Y. Harter, D. Borthakur, S Dong, A. Aiyer, L. Tang, Analysis of H Under HBase: A Facebook Messages Case Study, University of Wisconsin, Madison, 212. [11] Y. Xiaoyang, Estimating language model using hadoop and Hbase, Master of Artificial Intelligence, University of Edinburgh, 28. Type of Hbase table Fig. 9 Space cost in different tables. PDC-63