MapReduce Join Strategies for Key-Value Storage

Size: px
Start display at page:

Download "MapReduce Join Strategies for Key-Value Storage"

Transcription

1 MapReduce Join Strategies for Key-Value Storage Duong Van Hieu, Sucha Smanchat, and Phayung Meesad Faculty of Information Technology King Mongkut s University of Technology North Bangkok Bangkok 10800, Thailand Abstract This paper analyses MapReduce strategies used for big data analysis and mining known as map-side and reduce-side s. The most used s will be analysed in this paper, which are theta- algorithms including all pair partition, repartition, broadcasting, semi, persplit semi. This paper can be considered as a guideline for MapReduce application developers for the selection of strategies. The analysis of several strategies for big data analysis and mining is accompanied by comprehensive examples. Keywords MapReduce; strategy; NoSQL I. INTRODUCTION With the continuous development of big data and cloud computing, it is believed that traditional database technologies are insufficient for data storage and access, and also performance and flexibility requirements. In the new era of big data, NoSQL databases are more appropriate than relational databases [1]. Key-Value store, a kind of NoSQL databases, is an appropriate choice for applications that use MapReduce model for distributed processing. Key-Value stores offer only four underlying operators including inserting <key, value> pairs to a data collection, updating values of existing pairs, finding values associated with a specific key, and deleting pairs from a data collection [2]. Joining two data collections to produce a new dataset based on ing fields is a responsibility of programmers or application developers rather than of database management systems. However, several strategies existing, which have different advantages and disadvantages. To provide programmers a guideline to the selection of strategies, this study analyses several ing strategies for big data analysis and mining accompanied comprehensive examples. The content of this paper is organised into four main sections. Section 2 gives an overview of the MapReduce programming model, Section 3 explains MapReduce strategies, and Section 4 is the conclusion of and comparison of strategies used in MapReduce. II. MAPREDUCE OVERVIEW MapReduce has been used at Google since February 2003, and was first introduced in 2004 by Dean and Ghemawat [3] and in Communications of the ACM in 2008 [4]. It is used for processing large datasets in a parallel or distributed computing environment. It is a combination of map processes and reduce processes. A map process is a function that processes a set of input <key, value> pairs that is a portion of a large input dataset to generate a set of intermediate <key, value> pairs. A reduce process with a reduce function merges all of intermediate values generated by the map processes associated with the same intermediate key to form a possibly smaller set of <key, value> pairs, called final output <key, value> pairs. Fig. 1 is a simple word counting example. The input string data Advanced Research Methodology, Advanced Information Modelling and Database, Advanced Network and Information Security, Advanced Database and Distributed Systems is divided into four blocks corresponding to each subject name separated by commas. A Hash function mod(code(upper(left(key,1))),k)+1 is used for distributing intermediate <key, value> pairs into reduce tasks. The left(key,1) means taking the first letter of key, the upper(x) means changing x to upper case, the code (x) means taking ASCII code of character x, and the mod(m, k) means returning the remainder after m is divided by k. Intermediate <key, value> pairs <key,value> pairs distribution <key, value> pairs produced by Reduce Input data Key Value Key Value processs Advanced Advanced 1 Database 1 Block 1 Research Map 1 Research 1 Group 1 Database 1 Key Value Methodology Methodology 1 Distributed 1 Database 2 Key Value Key Value Distributed 1 Advanced Advanced 1 Advanced 1 Information Information 1 Advanced 1 Key Value Block 2 Modelling Map 2 Modelling 1 Advanced 1 Advanced 4 and and 1 Advanced 1 and 3 Database Database 1 Group 2 and 1 Reduce 2 Information 2 Key Value and 1 Methodology 1 Advanced Advanced 1 and 1 Modelling 1 Network Network 1 Information 1 Block 3 and Map 3 and 1 Information 1 Key Value Information Information 1 Methodology 1 Network 1 Reduce 1 Security Security 1 Modelling 1 Research 1 Key Value Key Value Advanced Advanced 1 Group 3 Network 1 Key Value Reduce 3 Database Database 1 Research 1 Security 1 Block 4 and Map 4 and 1 Key Value Systems 1 Distributed Distributed 1 Group 4 Security 1 Systems Systems 1 Systems 1 Reduce 4 Fig. 1. Map and reduce processes of a simple word counting example III. MAPREDUCE KEY JOIN STRATEGIES Physically, data in a Key-Value format can be stored in the form of a data structure such as B-Tree, Queue, and Hash table [5, 6]. Logically, each record in a Key-Value store is a single entry including a key and a value. To make it easy to understand, a set of <key, value> pairs, called data collection, can be 164

2 considered as a two-column table. The first column stores keys and the second one, which can be a combination of more than two columns, stores values associated with the keys. Joins using MapReduce can be categorized as map-side, reduce-side, memory-backed, using Bloom Filter, and map-reduce merge [7]. However, this paper follows the categories proposed by Tom White [8, 9], grouping into two types which are map-side s and reduce-side s. Map-side s are s per-formed by mappers, used to two large input datasets before feeding data to the map functions. Reduce-side s are s performed by reducers, being more general than a map-side because inputs do not need to be structured in any particular way [9]. In some cases, reduce-side s are less efficient than map-side s because datasets go through the MapReduce shuffle. For reduce-side ing, several components are involved. These are multiple inputs and secondary sorting [8]. Multiple inputs mean inputs from different sources can have different formats or presentations. To deal with this situation, multiple inputs need to be parsed separately. This parsing is provided in Hadoop, called per-path basis [10]. Secondary sorting occurs when reducers obtain inputs from two sources and each of them can be sorted by different orders. To solve this challenge, when the first dataset comes from source A sorted by key1, the second dataset comes from source B sorted by key2. The merged data should be sorted by a composite key (key1, key2) before reducing. A. Theta Joins Theta is a kind of that uses comparison operators such as <, <=, >, >=, =, <> in the predicates. Among these, equi- is the most used for ing two datasets to achieve the intersection between them. Fig. 2 is an example of equi-. This matches every record from table L to every record from table R which has the same value of the field. The results of ing can be projected to eliminate some redundant fields to produce only required fields. Among algorithms used in MapReduce literature listed in [11-15], it is believed that equi- strategies used in [11] are more efficient than those used in Yahoo Pig, Facebook Hive, and IBM Jaql. This paper focuses on the theta- implementation strategies proposed by Blanas et al.[11] and Okcan [16]. Theta algorithms will be analysed in the following sections. Table L Table R SELECT * FROM L, R WHERE L.Profs=R.Profs Stds Profs Stds Profs L.Stds L.Profs R.Stds R.Profs Aj Hin Sup Aj Jia Hiu Hiu Sup Aj Sul Lo Jia Hiu Jia Su Mar Ling Su Hiu Sul Suna Un Sul Lo Jia Lo Sul Fig. 2. A simple equi- example (using equi- on the field Profs) B. All Pair Partition Joins Given table R having R records and table L having L records, product of R and L is a set of R * L records. This traditional method takes a long time when ing two very large tables. To compute this product in MapReduce, table R and table L will be divided into u and v dis partitions, respectively. R * L records can be obtained from u*v products, each product partition (1, 1), partition (1, 2),.., partition (u, v) can be processed by a map or a reduce function. This method is called all pairs partition in MapReduce model [16]. Fig. 3. All pairs partition Each compound partition will be assigned to a map task. Output of the map task is <compound key, tagged record> pairs. A compound key is a combination of partition name from table R and L such as (1, 2), (1, 2), and (1, 3). To identify which record comes from which table, each record from table R or L will be tagged its table name, called tagged record. Each group of <compound key, tagged record> pairs will be passed to reducers. Before reducing data, this input data will be split into table R and L and they will in the same way as the traditional ing method. Table L Stds Profs Key Value lists Key Value lists Key Value lists Part 1 Aj ('R',Hin,Sup) ('R',Hin,Sup) ('R',Jia, ) Key Value lists Hiu ('R',Hiu,Sup) (1,1) ('R',Hiu,Sup) (1,2) ('R',Ling,Su) (2,1) empty R.Stds R.Profs L.Stds L.Profs Part 2 Part 3 Su Mar Jia Lo Lo Sun Un ('L',Aj, ) Key Value lists ('L',Lo,) Key Value lists ('L',Aj, ) Key Value lists Jia Jia Aj Dit Dit ('L',Hiu,) ('L',Dit,) ('L',Hiu,) Jia Hiu Table R ('R',Hin,Sup) ('R',Jia, ) ('R',Jia, ) Stds Profs ('R',Hiu,Sup) ('R',Ling,Su) ('R',Ling,Su) (1,3) (2,3) (2,2) Part 1 Hin Sup ('L',Su,Mar) ('L',Su,Mar) ('L',Lo,) Hiu Sup ('L',Sun,Un) ('L',Sun,Un) ('L',Dit,) Part 2 Jia Ling Su Fig. 4. An example of all pairs partition s (using equi- on the field Profs) 165

3 In Fig. 4, each record from table L and R will be added tag L and R, respectively. Those records are called tagged records. Only the composite key has records from both table L and R having the same key are fed to reduce functions. In this example, only partition (2, 1), partition (2, 2) has shared key records from table R and L, which will be used for ing. The remaining partitions will be ignored. Disadvantage of this ing is enumerating every pair may not be processed by reducers. C. Repartition Join Repartition is the most used strategy in MapReduce. Datasets L and R are dynamically split into parts based on the key and pairs of partitions from L and R will be ed [15]. It has two versions called standard repartition and improved repartition. The standard version is the same as the partitioned sortmerge that is used in parallel Rational Database Management Systems [11]. In the map phase, each map task works on a block of either table L or table R. To identify which table an input record is from, the map function tags each record with its original table and produces the extracted key and the tagged records. Output of the map function is a set of <_key, tagged_record> pairs. Join_key is the attribute used to two tables, and tagged_record is a compound of table name and record. These outputs are then partitioned, sorted, and merged. Then, all records for each key are grouped together and fed to a reducer. In the reduce phase, for each key, the reducer first separates and buffers the input records into two sets according to the table tagged, and then performs a cross-product between two sets. This following example uses hash function mod(code(upper(left( key,1))),2)+1 for distributing intermediate <key, value> pairs to each reducer (the similar has function used earlier). Table L Intermediate output Input of map functions Reduce process Stds Profs Join key Tagged Record key tagged record Table L Stds Profs Block1 Su Mar Map1 Mar ('L', Su, Mar) Group 1 ('L', Aj, ) Aj Aj ('L', Aj, ) ('L', Hiu, ) Reduce 1 Hiu Block2 Hiu Map2 ('L', Hiu, ) ('L', Lo, ) Lo Final result from reduce process Lo ('L', Lo, ) ('R', Jia, ) Table R Stds Profs L.Stds L.Profs R.Stds R.Profs Block3 Sun Un Map3 Un ('L', Sun, Un) ('R', Sul, ) Jia Aj Jia Table R Sul Aj Sul Stds Profs key tagged record Table L Stds Profs Hiu Jia Block1 Jia P hme P hme ('R', Jia, P hme) Mar ('L', Su, Mar) Su Mar Hiu P hme Sul P hme Map4 Sul ('R', Sul, ) Su ('R', Ling, Su) Reduce 2 Sun Un Lo Jia Block2 Ling Su Map5 Su ('R', Ling, Su) Sup ('R', Hin, Sup) Stds Profs Lo Sul Block3 Hin Sup Sup ('R', Hin, Sup) Sup ('R', Hiu, Sup) Table R Ling Su Map6 Hiu Sup Sup ('R', Hiu, Sup) Group 2 Un ('L', Sun, Un) Hin Sup Hiu Sup All records from table L and R will be buffered before ing and that may lead to insufficient memory problem, as encountered by Yahoo Pig and Facebook Hive [11, 17, 18]. To deal with this, improved repartition is proposed. In the improved version, the map function is changed. Output key of the map function is changed to a composite of key and table tag. The table tags will be generated in a way that guarantees that records from table R will be stored ahead merging, sorting, and groupin Fig. 5. An example of standard repartition s (using equi- on the field Profs) of those from table L on a given key. Partition function is also customised so that hash code is computed from just the key instead of composite key. Records are then grouped by just the key instead of the composite key. Grouping function in the reducer which groups records on the key, and ensures that records from table R are stored ahead of those from table L for a given key. To decrease buffer size, only the record, that have composite key containing all table tags will be written into buffer. Table R Output of map functions Input of reduce functiom Final result from reducer Stds Profs Comp. Keys Tagged Records Intermediate Results Keys Lists of Values L.Stds L.Profs R.Stds R.Profs Jia Map 1 [, R] ('R', Jia, ) Keys Tagged Records ([Jiaja, ], [AjPae, ]) Aj Jia Block 1 Sul [, R] ('R', Sul, ) [Mar, L] ('L', Su, Mar) ([Jiaja,], [Hiu, ]) Aj Sul Map 2 Block 2 Ling Su [Su, R] ('R', Ling, Su) [, R ('R', Jia, ) ([Jiaja, ], [Lo, ]) Hiu Jia [ R, L] Hin Sup [Sup, R] ('R', Hin, Sup) [, R ('R', Sul, ) ([Sul,], [AjPae, ]) Hiu Sul Block 3 Map 3 Hiu Sup [Sup, R] ('R', Hiu, Sup) [, L ('L', Aj, ) ([Sul, ], [Hiu, ]) Lo Jia Table L [, L ('L', Hiu, ) ([Sul, ], [Lo, ]) Lo Sul Stds Profs Comp. Keys Tagged Records [, L ('L', Lo, ) Keys Lists of Values Block 1 Su Mar Map 4 [Mar, L] ('L', Su, Mar) [Su, R] ('R', Ling, Su) [Mar,_, L] (_, [Su, Mar]) Aj [, L] ('L', Aj, ) [Sup, R] ('R', Hin, Sup) [Un, _, L] (_, [Sun, Un]) Block 2 Map 5 Hiu [, L] ('L', Hiu, ) [Sup, R] ('R', Hiu, Sup) [Su, R, _] ([Ling, Su],_) Lo [, L] ('L', Lo, ) [Un, L] ('L', Sun, Un) ([Hin, Sup],_) Map 6 [Sup, R, _] Block 3 Sun Un [Un, L] ('L', Sun, Un) ([Hiu, Sup],_) Fig. 6. Example of improved repartition s (using equi- on the field Profs) 166

4 D. Broadcasting Join Broadcast is used when table R is much smaller than table L. Instead of passing both tables R and L across the network, the smaller table will be broadcasted to larger table. This technique reduces sorting time and network traffic. At the beginning of each map function, broadcast checks whether R is stored on the local file system or not. If not, it retrieves table R from the distributed file system, and splits R into partitions on the key, and stores these partitions on the local file system. Hash table is built from table L or R depending on which one has smaller size. If R is smaller than a partition of L, then all partitions of R will be loaded to memory to build the hash table. The map function then extracts key value from each record from L, and uses it to probe the hash table and to generate output. If R is bigger than a split of L, ing is not done at the map function. The map function will map each partition of L with each partition of R using other strategies. Then, results from R and L will be ed at the end of the map process. In Fig. 7 and Fig. 8, table R is smaller than a part of table L, so it is broadcasted to each node. The map function loads all records from table R to build a hash table. For each record from a partition of table L, the map function finds its reference in the hash table, and outputs only those it has referenced. All unreferenced records from table L will be ignored. Split 1 Split 2 Table R StdId subject Hash table, Distributed function= (StdId mod 2) StdId Group Table R is used to build hash table Fig. 7. Building Hash table when R is smaller than any part of L Table L StdId Name L.StdId L.Name R.StdId R.subject Lo Map Dit Mo Hiu L.StdId L.Name R.StdId R.subject Bo Hash table Hash table Sher Dit Hash table Hash table Group Hiu Hiu Hash table Join key is used to probe hash table Cha L.StdId L.Name R.StdId R.subject Sul Map 2 L.StdId L.Name R.StdId R.subject Sul Sher Sul Group Sul Jia Sul Dit Dih Sher Tha Hash table Hash table Join key is used to probe hash table Ling Hash table Hash table Intermedia Results In some cases, a large portion of table R may not be referenced by any record from table L. For example R is a table of users including millions of records while L is a table of activities that users act during an hour. In this situation, only a few of records from table R are referenced by records from table L. However, when ing based on broadcasting, a large amount of records of table R are shipped across network and loaded into the hash table. If these data are not referenced based on the key, the network resource is wasted for the shipping. E. Semi Join The semi- proposed to solve the problem mentioned above is comprised of three phases as follows. The first phase runs as a full MapReduce job. In the map function, a main memory table of hash code is used for determining the set of unique key values in a part of table L. By sending only unique key values to the map output, number of records that need to be sorted is reduced. The reduce function processes unique key. In Fig. 9, all unique keys will be consolidated by a reducer, result from this phase is a single file called L.uk. Table L Hash table L1 StdId subject StdId output L.uk StdId Split Hash table L StdId Split Fig. 9. Example of the first phase in Semi s (using equi-) The second phase, similar to the broadcast, runs as a map job. Firstly, L.uk will be loaded into a memory hash table, the map function iterates each record from table R and outputs it if its key can be found in the L.uk. Each part of table R produces one file called Ri. Output of this phase is a list of file Ri as shown in Fig. 10. The third phase, all file Ri with table L using broadcast as shown in Fig. 11. One challenge of semi is that not every record in the Ri of R will with a particular part Li of table L. To solve this issue, per-split semi is proposed. Fig. 8. Example of broadcasting s when R is smaller than any part of L(using equi-) Split 1 Table R Output R1 Table R Output R2 StdId Name Map 1 StdId Name StdId Name Map 2 StdId Name Lo Hash table Dit Sul Hash table Sul Mo StdId Hiu Sher StdId Sher Bo Hash table Jia Hash table Split Dit Hash table Di Hash table Hiu Hash table Tha Hash table Cha Hash table Ling Hash table Fig. 10. Example of the second phase in Semi s (using equi-) 167

5 Table L Table L Intermediate results 2 StdId subject L.StdId L.Name R.StdId R.subject StdId subject L.StdId L.Name R.StdId R.subject Map R Map Sul Split R1 Split Sul Sher Map 2 Intermediate results Split L.StdId L.Name R.StdId R.subject Split Map Dit L.StdId L.Name R.StdId R.subject Output R Dit Output R R2 StdId Name Hiu StdId Name R2 R Dit Sul R Hieu Sher Fig. 11. Example of the last phase in Semi s (using equi-) F. Per-Split Semi Join Per-split semi consists of three phases. The first and the last phases are map jobs, and the second phase is a full map reduce job. The first phase is to generate the set of unique keys in a split Li of table L, and stores them in the distributed file system, called Li.uk. The second phase is to load all records from a split of table R into main memory hash table, and read the unique keys from file Li.uk and probe the hash table for matching records from R. Each matched record is outputted with a tag RLi, which is used by reduce function to collect all records from table R that will with Li. In the last phase, the results of the second phase and Li are ed directly as shown in Fig. 12 and Fig. 13. Fig. 12. Example of the first phase and second phase in Per-Split Semi Join Output of R Li.uk Tags StdId Name RL Sul Output of final phase RL Sher L.StdId L.Name R.StdId R.subject RL Dit Sul RL Hiu Sul Table L Sher StdId subject Dit Dit Hiu Fig. 13. Example of the last phase in Per-Split Semi Join (using equi-) IV. CONCLUSION Many of big data mining problems can be solved by using MapReduce associated with Key-Value store. Based on advantages and drawbacks of those explained strategies in terms of time and network resources consumption, we provide a comparison of strategies as shown in Table 1. TABLE 1. COMPARISION OF JOIN STRATEGIES Strategy Pros/Cons Suggestion All pair partition Easy to implement, all compound partition transferred to reducers may not be processed by reducers. Used when two datasets have more data in common, be sorted by the same fields. Standard repartition Improved repartition Broadcasting Semi- Per-split semi Easy to implement, all records from both tables will be buffered before ing that may lead to insufficient memory problem. To reduce buffer size, implementation is more complex than the standard version. Reduce sorting time and network traffic. May waste of network resource. Some records from parts of a table broadcasted to another table may not be ed. Complicated implementation, more reading and writing operations. Same with all pair partition. Used when two ed datasets have few data in common, be sorted by the same fields. Used when one table is much smaller than the other table. Used when a large portion of a table may not be referenced by any record from the other table. Same with semi. Which strategy should be used in any problem depends on nature of the data and available network resources. If two 168

6 ed tables have more data in common or having sufficient network resources, all pair partition, repartition should be used because its implementation is not as complex as the others. If two ed tables have few data in common or having inadequate network resources, broadcasting, semi, per-split semi should be used because it may reduce time and resources consumption. Data in NoSQL database can be structured, semi-structured, or unstructured; and can be stored in many types of data structures such as indexed table of relational database, B-Tree, Queue, Hash table. Therefore, in addition to the consideration presented in this paper, selection of strategies is also affected by data structures. MapReduce programmers may also need to consider data accessing time, data sorting time when selecting ing strategy. This issue is beyond the scope of this paper and is left for future research. REFERENCES [1] Mapanga, I. and P. Kadebu, Database Management Systems: A NoSQL Analysis. Interna-tional Journal of Modern Communication Technologies & Research (IJMCTR), : p [2] Hecht, R. and S. Jablonski. NoSQL evaluation: A use case oriented survey. in Cloud and Service Computing (CSC), 2011 International Conference on [3] Dean, J. and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in OSDI '04: Sixth Symposium on Operating Systems Design and Implementation. 2004, USENIX: San Francisco, California, USA. p [4] Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters, in Communications of the ACM - 50th anniversary issue: p [5] Celko, J., Chapter 6. Key Value Stores, in Joe Celko's complete guide to NoSQL : what every SQL professional needs to know about nonrelational databases, A. Dierna and H. Scherer, Editors. 2014, Morgan Kaufmann, Elsevier: USA. p [6] Oracle, Chapter 1. Introduction to Berkeley DB, in Oracle Berkeley DB: Getting Started with Berkeley DB for C p [7] Jadhav, V., J. Aghav, and S. Dorwani, Join Algorithms Using MapReduce: A Survey, in International Conference on Electrical Engineering and Computer Science. 2013, IOAJ INDIA: Coimbatore, Tamil Nadu, India. p [8] White, T., Chapter 8. MapReuce Features, in Hadoop: The Definitive Guide, Second Edi-tion, M. Loukides, Editor. 2011, O'Reilly Media, Inc.,: USA. p [9] White, T., Chapter 8. MapReduce Features, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.,: USA. p [10] White, T., Chapter 7. MapReduce Types and Formats, in Hadoop: The Definitive Guide, Third Edition, M. Loukides and M. Blanchette, Editors. 2012, O'Reilly Media, Inc.: USA. p [11] Blanas, S., et al., A comparison of algorithms for log processing in MaPreduce, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010, ACM: Indianapolis, Indiana, USA. p [12] Özsu, M.T. and P. Valduriez, Chapter 3. Distributed Database Design, in Principles of Dis-tributed Database Systems, Third Edition. 2011, Springer New York. p [13] Bernstein, P.A., et al., Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., (4): p [14] Lee, K.-H., et al., Parallel data processing with MapReduce: a survey. SIGMOD Rec., (4): p [15] Okcan, A. and M. Riedewald, Processing theta-s using MapReduce, in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, ACM: Athens, Greece. p [16] Shim, K., MapReduce algorithms for big data analysis, in Proceedings of the VLDB En-dowment 2012, VLDB Endowment. p [17] Olston, C., et al., Pig latin: a not-so-foreign language for data processing, in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, ACM: Vancouver, Canada. p [18] Hive, A., Theta Join

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

More information

Processing Joins over Big Data in MapReduce

Processing Joins over Big Data in MapReduce Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2

Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Research Paper BLOOM JOIN FINE-TUNES DISTRIBUTED QUERY IN HADOOP ENVIRONMENT Dr. Sunita M. Mahajan 1 and Ms. Vaishali P. Jadhav 2 Address for Correspondence 1 Principal, Mumbai Education Trust, Bandra,

More information

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of Wisconsin-Madison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

A Case for Flash Memory SSD in Hadoop Applications

A Case for Flash Memory SSD in Hadoop Applications A Case for Flash Memory SSD in Hadoop Applications Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang and Sang-Won Lee Dept of Computer Engineering, Sungkyunkwan University, Korea x860221@gmail.com, smwindy@naver.com,

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Relational Processing on MapReduce

Relational Processing on MapReduce Relational Processing on MapReduce Jerome Simeon IBM Watson Research Content obtained from many sources, notably: Jimmy Lin course on MapReduce. Our Plan Today 1. Recap: Key relational DBMS notes Key Hadoop

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS...253 PART 4 BEYOND MAPREDUCE...385 brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 1 Hadoop in a heartbeat 3 2 Introduction to YARN 22 PART 2 DATA LOGISTICS...59 3 Data serialization working with text and beyond 61 4 Organizing and

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Toward Lightweight Transparent Data Middleware in Support of Document Stores

Toward Lightweight Transparent Data Middleware in Support of Document Stores Toward Lightweight Transparent Data Middleware in Support of Document Stores Kun Ma, Ajith Abraham Shandong Provincial Key Laboratory of Network Based Intelligent Computing University of Jinan, Jinan,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch September 16, 2013 15-09-2013 1

Big Data Management. Big Data Management. (BDM) Autumn 2013. Povl Koch September 16, 2013 15-09-2013 1 Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch September 16, 2013 15-09-2013 1 Overview Today s program 1. Little more practical details about this course 2. Chapter 7 in NoSQL Distilled

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

More information

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment

Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

Data Management Using MapReduce

Data Management Using MapReduce Data Management Using MapReduce M. Tamer Özsu University of Waterloo CS742-Distributed & Parallel DBMS M. Tamer Özsu 1 / 24 Basics For data analysis of very large data sets Highly dynamic, irregular, schemaless,

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

A computational model for MapReduce job flow

A computational model for MapReduce job flow A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

More information

MapReduce for Data Warehouses

MapReduce for Data Warehouses MapReduce for Data Warehouses Data Warehouses: Hadoop and Relational Databases In an enterprise setting, a data warehouse serves as a vast repository of data, holding everything from sales transactions

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

JackHare: a framework for SQL to NoSQL translation using MapReduce

JackHare: a framework for SQL to NoSQL translation using MapReduce DOI 10.1007/s10515-013-0135-x JackHare: a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung Hung-Pin Lin Shih-Chang Chen Mon-Fong Jiang Yeh-Ching Chung Received: 15 December 2012 / Accepted:

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc.

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece mattos@csd.uoc.gr Joining Cassandra Binjiang Tao Computer Science Department University of Crete Heraklion,

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Play with Big Data on the Shoulders of Open Source

Play with Big Data on the Shoulders of Open Source OW2 Open Source Corporate Network Meeting Play with Big Data on the Shoulders of Open Source Liu Jie Technology Center of Software Engineering Institute of Software, Chinese Academy of Sciences 2012-10-19

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05 Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970

More information

Distributed Data Management Summer Semester 2015 TU Kaiserslautern

Distributed Data Management Summer Semester 2015 TU Kaiserslautern Distributed Data Management Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Distributed Data Management,

More information

Using Intermediate Data of Map Reduce for Faster Execution

Using Intermediate Data of Map Reduce for Faster Execution Using Intermediate Data of Map for Faster Execution Shah Pratik Prakash, Pattabiraman V Abstract Data of any kind structured, unstructured or semistructured is generated in large quantity around the globe

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE Sayalee Narkhede 1 and Tripti Baraskar 2 Department of Information Technology, MIT-Pune,University of Pune, Pune sayleenarkhede@gmail.com

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Some issues on Conceptual Modeling and NoSQL/Big Data

Some issues on Conceptual Modeling and NoSQL/Big Data Some issues on Conceptual Modeling and NoSQL/Big Data Tok Wang Ling National University of Singapore 1 Database Models File system - field, record, fixed length record Hierarchical Model (IMS) - fixed

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12 Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language

More information

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department Pla7orms for Big Data Management and Analysis Michael J. Carey Informa(on Systems Group UCI CS Department Outline Big Data Pla6orm Space The Big Data Era Brief History of Data Pla6orms Dominant Pla6orms

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12

Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12 Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2012/13 Hadoop Ecosystem Overview of this Lecture Module Background Google MapReduce The Hadoop Ecosystem Core components: Hadoop

More information

American International Journal of Research in Science, Technology, Engineering & Mathematics

American International Journal of Research in Science, Technology, Engineering & Mathematics American International Journal of Research in Science, Technology, Engineering & Mathematics Available online at http://www.iasir.net ISSN (Print): 2328-3491, ISSN (Online): 2328-3580, ISSN (CD-ROM): 2328-3629

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Distributed Recommenders. Fall 2010

Distributed Recommenders. Fall 2010 Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com.

Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases Abinav Pothuganti Computer Science and Engineering, CBIT,Hyderabad, Telangana, India Abstract Today, we are surrounded by data like oxygen. The exponential

More information

Data Management in the Cloud

Data Management in the Cloud Cloud Computing Data Management in the Cloud Dell Zhang Birkbeck, University of London 2015/16 Data Management in Today s Organisations Big Data Analysis Peta-scale datasets are everywhere: Facebook: 2.5PB

More information

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,

More information

Query and Analysis of Data on Electric Consumption Based on Hadoop

Query and Analysis of Data on Electric Consumption Based on Hadoop , pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Can the Elephants Handle the NoSQL Onslaught?

Can the Elephants Handle the NoSQL Onslaught? Can the Elephants Handle the NoSQL Onslaught? Avrilia Floratou, Nikhil Teletia David J. DeWitt, Jignesh M. Patel, Donghui Zhang University of Wisconsin-Madison Microsoft Jim Gray Systems Lab Presented

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

8. Query Processing. Query Processing & Optimization

8. Query Processing. Query Processing & Optimization ECS-165A WQ 11 136 8. Query Processing Goals: Understand the basic concepts underlying the steps in query processing and optimization and estimating query processing cost; apply query optimization techniques;

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance. Volume 4, Issue 11, November 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Analytics

More information