MapReduce Join Strategies for Key-Value Storage

Transcription

1 MapReduce Join Strategies for Key-Value Storage Duong Van Hieu, Sucha Smanchat, and Phayung Meesad Faculty of Information Technology King Mongkut s University of Technology North Bangkok Bangkok 10800, Thailand duongvanhieu@tgu.edu.vn, {suchas@kmutnb.ac.th,sucha.smanchat@acm.org}, pym@kmutnb.ac.th Abstract This paper analyses MapReduce strategies used for big data analysis and mining known as map-side and reduce-side s. The most used s will be analysed in this paper, which are theta- algorithms including all pair partition, repartition, broadcasting, semi, persplit semi. This paper can be considered as a guideline for MapReduce application developers for the selection of strategies. The analysis of several strategies for big data analysis and mining is accompanied by comprehensive examples. Keywords MapReduce; strategy; NoSQL I. INTRODUCTION With the continuous development of big data and cloud computing, it is believed that traditional database technologies are insufficient for data storage and access, and also performance and flexibility requirements. In the new era of big data, NoSQL databases are more appropriate than relational databases [1]. Key-Value store, a kind of NoSQL databases, is an appropriate choice for applications that use MapReduce model for distributed processing. Key-Value stores offer only four underlying operators including inserting <key, value> pairs to a data collection, updating values of existing pairs, finding values associated with a specific key, and deleting pairs from a data collection [2]. Joining two data collections to produce a new dataset based on ing fields is a responsibility of programmers or application developers rather than of database management systems. However, several strategies existing, which have different advantages and disadvantages. To provide programmers a guideline to the selection of strategies, this study analyses several ing strategies for big data analysis and mining accompanied comprehensive examples. The content of this paper is organised into four main sections. Section 2 gives an overview of the MapReduce programming model, Section 3 explains MapReduce strategies, and Section 4 is the conclusion of and comparison of strategies used in MapReduce. II. MAPREDUCE OVERVIEW MapReduce has been used at Google since February 2003, and was first introduced in 2004 by Dean and Ghemawat [3] and in Communications of the ACM in 2008 [4]. It is used for processing large datasets in a parallel or distributed computing environment. It is a combination of map processes and reduce processes. A map process is a function that processes a set of input <key, value> pairs that is a portion of a large input dataset to generate a set of intermediate <key, value> pairs. A reduce process with a reduce function merges all of intermediate values generated by the map processes associated with the same intermediate key to form a possibly smaller set of <key, value> pairs, called final output <key, value> pairs. Fig. 1 is a simple word counting example. The input string data Advanced Research Methodology, Advanced Information Modelling and Database, Advanced Network and Information Security, Advanced Database and Distributed Systems is divided into four blocks corresponding to each subject name separated by commas. A Hash function mod(code(upper(left(key,1))),k)+1 is used for distributing intermediate <key, value> pairs into reduce tasks. The left(key,1) means taking the first letter of key, the upper(x) means changing x to upper case, the code (x) means taking ASCII code of character x, and the mod(m, k) means returning the remainder after m is divided by k. Intermediate <key, value> pairs <key,value> pairs distribution <key, value> pairs produced by Reduce Input data Key Value Key Value processs Advanced Advanced 1 Database 1 Block 1 Research Map 1 Research 1 Group 1 Database 1 Key Value Methodology Methodology 1 Distributed 1 Database 2 Key Value Key Value Distributed 1 Advanced Advanced 1 Advanced 1 Information Information 1 Advanced 1 Key Value Block 2 Modelling Map 2 Modelling 1 Advanced 1 Advanced 4 and and 1 Advanced 1 and 3 Database Database 1 Group 2 and 1 Reduce 2 Information 2 Key Value and 1 Methodology 1 Advanced Advanced 1 and 1 Modelling 1 Network Network 1 Information 1 Block 3 and Map 3 and 1 Information 1 Key Value Information Information 1 Methodology 1 Network 1 Reduce 1 Security Security 1 Modelling 1 Research 1 Key Value Key Value Advanced Advanced 1 Group 3 Network 1 Key Value Reduce 3 Database Database 1 Research 1 Security 1 Block 4 and Map 4 and 1 Key Value Systems 1 Distributed Distributed 1 Group 4 Security 1 Systems Systems 1 Systems 1 Reduce 4 Fig. 1. Map and reduce processes of a simple word counting example III. MAPREDUCE KEY JOIN STRATEGIES Physically, data in a Key-Value format can be stored in the form of a data structure such as B-Tree, Queue, and Hash table [5, 6]. Logically, each record in a Key-Value store is a single entry including a key and a value. To make it easy to understand, a set of <key, value> pairs, called data collection, can be 164

2 considered as a two-column table. The first column stores keys and the second one, which can be a combination of more than two columns, stores values associated with the keys. Joins using MapReduce can be categorized as map-side, reduce-side, memory-backed, using Bloom Filter, and map-reduce merge [7]. However, this paper follows the categories proposed by Tom White [8, 9], grouping into two types which are map-side s and reduce-side s. Map-side s are s per-formed by mappers, used to two large input datasets before feeding data to the map functions. Reduce-side s are s performed by reducers, being more general than a map-side because inputs do not need to be structured in any particular way [9]. In some cases, reduce-side s are less efficient than map-side s because datasets go through the MapReduce shuffle. For reduce-side ing, several components are involved. These are multiple inputs and secondary sorting [8]. Multiple inputs mean inputs from different sources can have different formats or presentations. To deal with this situation, multiple inputs need to be parsed separately. This parsing is provided in Hadoop, called per-path basis [10]. Secondary sorting occurs when reducers obtain inputs from two sources and each of them can be sorted by different orders. To solve this challenge, when the first dataset comes from source A sorted by key1, the second dataset comes from source B sorted by key2. The merged data should be sorted by a composite key (key1, key2) before reducing. A. Theta Joins Theta is a kind of that uses comparison operators such as <, <=, >, >=, =, <> in the predicates. Among these, equi- is the most used for ing two datasets to achieve the intersection between them. Fig. 2 is an example of equi-. This matches every record from table L to every record from table R which has the same value of the field. The results of ing can be projected to eliminate some redundant fields to produce only required fields. Among algorithms used in MapReduce literature listed in [11-15], it is believed that equi- strategies used in [11] are more efficient than those used in Yahoo Pig, Facebook Hive, and IBM Jaql. This paper focuses on the theta- implementation strategies proposed by Blanas et al.[11] and Okcan [16]. Theta algorithms will be analysed in the following sections. Table L Table R SELECT * FROM L, R WHERE L.Profs=R.Profs Stds Profs Stds Profs L.Stds L.Profs R.Stds R.Profs Aj Hin Sup Aj Jia Hiu Hiu Sup Aj Sul Lo Jia Hiu Jia Su Mar Ling Su Hiu Sul Suna Un Sul Lo Jia Lo Sul Fig. 2. A simple equi- example (using equi- on the field Profs) B. All Pair Partition Joins Given table R having R records and table L having L records, product of R and L is a set of R * L records. This traditional method takes a long time when ing two very large tables. To compute this product in MapReduce, table R and table L will be divided into u and v dis partitions, respectively. R * L records can be obtained from u*v products, each product partition (1, 1), partition (1, 2),.., partition (u, v) can be processed by a map or a reduce function. This method is called all pairs partition in MapReduce model [16]. Fig. 3. All pairs partition Each compound partition will be assigned to a map task. Output of the map task is <compound key, tagged record> pairs. A compound key is a combination of partition name from table R and L such as (1, 2), (1, 2), and (1, 3). To identify which record comes from which table, each record from table R or L will be tagged its table name, called tagged record. Each group of <compound key, tagged record> pairs will be passed to reducers. Before reducing data, this input data will be split into table R and L and they will in the same way as the traditional ing method. Table L Stds Profs Key Value lists Key Value lists Key Value lists Part 1 Aj ('R',Hin,Sup) ('R',Hin,Sup) ('R',Jia, ) Key Value lists Hiu ('R',Hiu,Sup) (1,1) ('R',Hiu,Sup) (1,2) ('R',Ling,Su) (2,1) empty R.Stds R.Profs L.Stds L.Profs Part 2 Part 3 Su Mar Jia Lo Lo Sun Un ('L',Aj, ) Key Value lists ('L',Lo,) Key Value lists ('L',Aj, ) Key Value lists Jia Jia Aj Dit Dit ('L',Hiu,) ('L',Dit,) ('L',Hiu,) Jia Hiu Table R ('R',Hin,Sup) ('R',Jia, ) ('R',Jia, ) Stds Profs ('R',Hiu,Sup) ('R',Ling,Su) ('R',Ling,Su) (1,3) (2,3) (2,2) Part 1 Hin Sup ('L',Su,Mar) ('L',Su,Mar) ('L',Lo,) Hiu Sup ('L',Sun,Un) ('L',Sun,Un) ('L',Dit,) Part 2 Jia Ling Su Fig. 4. An example of all pairs partition s (using equi- on the field Profs) 165

3 In Fig. 4, each record from table L and R will be added tag L and R, respectively. Those records are called tagged records. Only the composite key has records from both table L and R having the same key are fed to reduce functions. In this example, only partition (2, 1), partition (2, 2) has shared key records from table R and L, which will be used for ing. The remaining partitions will be ignored. Disadvantage of this ing is enumerating every pair may not be processed by reducers. C. Repartition Join Repartition is the most used strategy in MapReduce. Datasets L and R are dynamically split into parts based on the key and pairs of partitions from L and R will be ed [15]. It has two versions called standard repartition and improved repartition. The standard version is the same as the partitioned sortmerge that is used in parallel Rational Database Management Systems [11]. In the map phase, each map task works on a block of either table L or table R. To identify which table an input record is from, the map function tags each record with its original table and produces the extracted key and the tagged records. Output of the map function is a set of <_key, tagged_record> pairs. Join_key is the attribute used to two tables, and tagged_record is a compound of table name and record. These outputs are then partitioned, sorted, and merged. Then, all records for each key are grouped together and fed to a reducer. In the reduce phase, for each key, the reducer first separates and buffers the input records into two sets according to the table tagged, and then performs a cross-product between two sets. This following example uses hash function mod(code(upper(left( key,1))),2)+1 for distributing intermediate <key, value> pairs to each reducer (the similar has function used earlier). Table L Intermediate output Input of map functions Reduce process Stds Profs Join key Tagged Record key tagged record Table L Stds Profs Block1 Su Mar Map1 Mar ('L', Su, Mar) Group 1 ('L', Aj, ) Aj Aj ('L', Aj, ) ('L', Hiu, ) Reduce 1 Hiu Block2 Hiu Map2 ('L', Hiu, ) ('L', Lo, ) Lo Final result from reduce process Lo ('L', Lo, ) ('R', Jia, ) Table R Stds Profs L.Stds L.Profs R.Stds R.Profs Block3 Sun Un Map3 Un ('L', Sun, Un) ('R', Sul, ) Jia Aj Jia Table R Sul Aj Sul Stds Profs key tagged record Table L Stds Profs Hiu Jia Block1 Jia P hme P hme ('R', Jia, P hme) Mar ('L', Su, Mar) Su Mar Hiu P hme Sul P hme Map4 Sul ('R', Sul, ) Su ('R', Ling, Su) Reduce 2 Sun Un Lo Jia Block2 Ling Su Map5 Su ('R', Ling, Su) Sup ('R', Hin, Sup) Stds Profs Lo Sul Block3 Hin Sup Sup ('R', Hin, Sup) Sup ('R', Hiu, Sup) Table R Ling Su Map6 Hiu Sup Sup ('R', Hiu, Sup) Group 2 Un ('L', Sun, Un) Hin Sup Hiu Sup All records from table L and R will be buffered before ing and that may lead to insufficient memory problem, as encountered by Yahoo Pig and Facebook Hive [11, 17, 18]. To deal with this, improved repartition is proposed. In the improved version, the map function is changed. Output key of the map function is changed to a composite of key and table tag. The table tags will be generated in a way that guarantees that records from table R will be stored ahead merging, sorting, and groupin Fig. 5. An example of standard repartition s (using equi- on the field Profs) of those from table L on a given key. Partition function is also customised so that hash code is computed from just the key instead of composite key. Records are then grouped by just the key instead of the composite key. Grouping function in the reducer which groups records on the key, and ensures that records from table R are stored ahead of those from table L for a given key. To decrease buffer size, only the record, that have composite key containing all table tags will be written into buffer. Table R Output of map functions Input of reduce functiom Final result from reducer Stds Profs Comp. Keys Tagged Records Intermediate Results Keys Lists of Values L.Stds L.Profs R.Stds R.Profs Jia Map 1 [, R] ('R', Jia, ) Keys Tagged Records ([Jiaja, ], [AjPae, ]) Aj Jia Block 1 Sul [, R] ('R', Sul, ) [Mar, L] ('L', Su, Mar) ([Jiaja,], [Hiu, ]) Aj Sul Map 2 Block 2 Ling Su [Su, R] ('R', Ling, Su) [, R ('R', Jia, ) ([Jiaja, ], [Lo, ]) Hiu Jia [ R, L] Hin Sup [Sup, R] ('R', Hin, Sup) [, R ('R', Sul, ) ([Sul,], [AjPae, ]) Hiu Sul Block 3 Map 3 Hiu Sup [Sup, R] ('R', Hiu, Sup) [, L ('L', Aj, ) ([Sul, ], [Hiu, ]) Lo Jia Table L [, L ('L', Hiu, ) ([Sul, ], [Lo, ]) Lo Sul Stds Profs Comp. Keys Tagged Records [, L ('L', Lo, ) Keys Lists of Values Block 1 Su Mar Map 4 [Mar, L] ('L', Su, Mar) [Su, R] ('R', Ling, Su) [Mar,_, L] (_, [Su, Mar]) Aj [, L] ('L', Aj, ) [Sup, R] ('R', Hin, Sup) [Un, _, L] (_, [Sun, Un]) Block 2 Map 5 Hiu [, L] ('L', Hiu, ) [Sup, R] ('R', Hiu, Sup) [Su, R, _] ([Ling, Su],_) Lo [, L] ('L', Lo, ) [Un, L] ('L', Sun, Un) ([Hin, Sup],_) Map 6 [Sup, R, _] Block 3 Sun Un [Un, L] ('L', Sun, Un) ([Hiu, Sup],_) Fig. 6. Example of improved repartition s (using equi- on the field Profs) 166

4 D. Broadcasting Join Broadcast is used when table R is much smaller than table L. Instead of passing both tables R and L across the network, the smaller table will be broadcasted to larger table. This technique reduces sorting time and network traffic. At the beginning of each map function, broadcast checks whether R is stored on the local file system or not. If not, it retrieves table R from the distributed file system, and splits R into partitions on the key, and stores these partitions on the local file system. Hash table is built from table L or R depending on which one has smaller size. If R is smaller than a partition of L, then all partitions of R will be loaded to memory to build the hash table. The map function then extracts key value from each record from L, and uses it to probe the hash table and to generate output. If R is bigger than a split of L, ing is not done at the map function. The map function will map each partition of L with each partition of R using other strategies. Then, results from R and L will be ed at the end of the map process. In Fig. 7 and Fig. 8, table R is smaller than a part of table L, so it is broadcasted to each node. The map function loads all records from table R to build a hash table. For each record from a partition of table L, the map function finds its reference in the hash table, and outputs only those it has referenced. All unreferenced records from table L will be ignored. Split 1 Split 2 Table R StdId subject Hash table, Distributed function= (StdId mod 2) StdId Group Table R is used to build hash table Fig. 7. Building Hash table when R is smaller than any part of L Table L StdId Name L.StdId L.Name R.StdId R.subject Lo Map Dit Mo Hiu L.StdId L.Name R.StdId R.subject Bo Hash table Hash table Sher Dit Hash table Hash table Group Hiu Hiu Hash table Join key is used to probe hash table Cha L.StdId L.Name R.StdId R.subject Sul Map 2 L.StdId L.Name R.StdId R.subject Sul Sher Sul Group Sul Jia Sul Dit Dih Sher Tha Hash table Hash table Join key is used to probe hash table Ling Hash table Hash table Intermedia Results In some cases, a large portion of table R may not be referenced by any record from table L. For example R is a table of users including millions of records while L is a table of activities that users act during an hour. In this situation, only a few of records from table R are referenced by records from table L. However, when ing based on broadcasting, a large amount of records of table R are shipped across network and loaded into the hash table. If these data are not referenced based on the key, the network resource is wasted for the shipping. E. Semi Join The semi- proposed to solve the problem mentioned above is comprised of three phases as follows. The first phase runs as a full MapReduce job. In the map function, a main memory table of hash code is used for determining the set of unique key values in a part of table L. By sending only unique key values to the map output, number of records that need to be sorted is reduced. The reduce function processes unique key. In Fig. 9, all unique keys will be consolidated by a reducer, result from this phase is a single file called L.uk. Table L Hash table L1 StdId subject StdId output L.uk StdId Split Hash table L StdId Split Fig. 9. Example of the first phase in Semi s (using equi-) The second phase, similar to the broadcast, runs as a map job. Firstly, L.uk will be loaded into a memory hash table, the map function iterates each record from table R and outputs it if its key can be found in the L.uk. Each part of table R produces one file called Ri. Output of this phase is a list of file Ri as shown in Fig. 10. The third phase, all file Ri with table L using broadcast as shown in Fig. 11. One challenge of semi is that not every record in the Ri of R will with a particular part Li of table L. To solve this issue, per-split semi is proposed. Fig. 8. Example of broadcasting s when R is smaller than any part of L(using equi-) Split 1 Table R Output R1 Table R Output R2 StdId Name Map 1 StdId Name StdId Name Map 2 StdId Name Lo Hash table Dit Sul Hash table Sul Mo StdId Hiu Sher StdId Sher Bo Hash table Jia Hash table Split Dit Hash table Di Hash table Hiu Hash table Tha Hash table Cha Hash table Ling Hash table Fig. 10. Example of the second phase in Semi s (using equi-) 167

5 Table L Table L Intermediate results 2 StdId subject L.StdId L.Name R.StdId R.subject StdId subject L.StdId L.Name R.StdId R.subject Map R Map Sul Split R1 Split Sul Sher Map 2 Intermediate results Split L.StdId L.Name R.StdId R.subject Split Map Dit L.StdId L.Name R.StdId R.subject Output R Dit Output R R2 StdId Name Hiu StdId Name R2 R Dit Sul R Hieu Sher Fig. 11. Example of the last phase in Semi s (using equi-) F. Per-Split Semi Join Per-split semi consists of three phases. The first and the last phases are map jobs, and the second phase is a full map reduce job. The first phase is to generate the set of unique keys in a split Li of table L, and stores them in the distributed file system, called Li.uk. The second phase is to load all records from a split of table R into main memory hash table, and read the unique keys from file Li.uk and probe the hash table for matching records from R. Each matched record is outputted with a tag RLi, which is used by reduce function to collect all records from table R that will with Li. In the last phase, the results of the second phase and Li are ed directly as shown in Fig. 12 and Fig. 13. Fig. 12. Example of the first phase and second phase in Per-Split Semi Join Output of R Li.uk Tags StdId Name RL Sul Output of final phase RL Sher L.StdId L.Name R.StdId R.subject RL Dit Sul RL Hiu Sul Table L Sher StdId subject Dit Dit Hiu Fig. 13. Example of the last phase in Per-Split Semi Join (using equi-) IV. CONCLUSION Many of big data mining problems can be solved by using MapReduce associated with Key-Value store. Based on advantages and drawbacks of those explained strategies in terms of time and network resources consumption, we provide a comparison of strategies as shown in Table 1. TABLE 1. COMPARISION OF JOIN STRATEGIES Strategy Pros/Cons Suggestion All pair partition Easy to implement, all compound partition transferred to reducers may not be processed by reducers. Used when two datasets have more data in common, be sorted by the same fields. Standard repartition Improved repartition Broadcasting Semi- Per-split semi Easy to implement, all records from both tables will be buffered before ing that may lead to insufficient memory problem. To reduce buffer size, implementation is more complex than the standard version. Reduce sorting time and network traffic. May waste of network resource. Some records from parts of a table broadcasted to another table may not be ed. Complicated implementation, more reading and writing operations. Same with all pair partition. Used when two ed datasets have few data in common, be sorted by the same fields. Used when one table is much smaller than the other table. Used when a large portion of a table may not be referenced by any record from the other table. Same with semi. Which strategy should be used in any problem depends on nature of the data and available network resources. If two 168

6 ed tables have more data in common or having sufficient network resources, all pair partition, repartition should be used because its implementation is not as complex as the others. If two ed tables have few data in common or having inadequate network resources, broadcasting, semi, per-split semi should be used because it may reduce time and resources consumption. Data in NoSQL database can be structured, semi-structured, or unstructured; and can be stored in many types of data structures such as indexed table of relational database, B-Tree, Queue, Hash table. Therefore, in addition to the consideration presented in this paper, selection of strategies is also affected by data structures. MapReduce programmers may also need to consider data accessing time, data sorting time when selecting ing strategy. This issue is beyond the scope of this paper and is left for future research. REFERENCES [1] Mapanga, I. and P. Kadebu, Database Management Systems: A NoSQL Analysis. Interna-tional Journal of Modern Communication Technologies & Research (IJMCTR), : p [2] Hecht, R. and S. Jablonski. NoSQL evaluation: A use case oriented survey. in Cloud and Service Computing (CSC), 2011 International Conference on [3] Dean, J. and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI '04: Sixth Symposium on Operating Systems Design and Implementation. 2004, USENIX: San Francisco, California, USA. p [4] Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters, in Communications of the ACM - 50th anniversary issue: p [5] Celko, J., Chapter 6. Key Value Stores, in Joe Celko's complete guide to NoSQL : what every SQL professional needs to know about nonrelational databases, A. Dierna and H. Scherer, Editors. 2014, Morgan Kaufmann, Elsevier: USA. p [6] Oracle, Chapter 1. Introduction to Berkeley DB, in Oracle Berkeley DB: Getting Started with Berkeley DB for C p [7] Jadhav, V., J. Aghav, and S. Dorwani, Join Algorithms Using MapReduce: A Survey, in International Conference on Electrical Engineering and Computer Science. 2013, IOAJ INDIA: Coimbatore, Tamil Nadu, India. p [8] White, T., Chapter 8. MapReuce Features, in Hadoop: The Definitive Guide, Second Edi-tion, M. Loukides, Editor. 2011, O'Reilly Media, Inc.,: USA. p [9] White, T., Chapter 8. MapReduce Features, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.,: USA. p [10] White, T., Chapter 7. MapReduce Types and Formats, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.: USA. p [11] Blanas, S., et al., A comparison of algorithms for log processing in MaPreduce, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010, ACM: Indianapolis, Indiana, USA. p [12] Özsu, M.T. and P. Valduriez, Chapter 3. Distributed Database Design, in Principles of Dis-tributed Database Systems, Third Edition. 2011, Springer New York. p [13] Bernstein, P.A., et al., Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., (4): p [14] Lee, K.-H., et al., Parallel data processing with MapReduce: a survey. SIGMOD Rec., (4): p [15] Okcan, A. and M. Riedewald, Processing theta-s using MapReduce, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, ACM: Athens, Greece. p [16] Shim, K., MapReduce algorithms for big data analysis, in Proceedings of the VLDB En-dowment 2012, VLDB Endowment. p [17] Olston, C., et al., Pig latin: a not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, ACM: Vancouver, Canada. p [18] Hive, A., Theta Join