A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
|
|
|
- Myles Chandler
- 9 years ago
- Views:
Transcription
1 Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple s Sue-Chen Hsueh, Ming-Yen Lin *, and Yi-Chun Chiu Dept. of Information Management, Chaoyang University of Technology, Taichung, Taiwan Dept. of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan [email protected]; [email protected]; [email protected] Abstract Entity resolution (ER), which detects records referring to the same entity across data sources, is a long-lasting challenge in database management research. The sheer volume of data collections today calls for the need of a blocking-based ER algorithm using the MapReduce framework for cloud computing. Most studies on blocking-based ER assume that only one blocking key is associated with an entity. An entity in reality may have multiple blocking keys in some applications. When the entities have a number of blocking keys, ER can be more efficient since two entities can form a similar pair only if they share several common keys. Therefore, we propose a MapReduce algorithm to solve the ER problem for a huge collection of entities with multiple keys. The algorithm is characterized in the combination-based blocking and the load-balanced matching. The combination-based blocking utilizes the multiple keys to sort out necessary s for future matching. The load-balanced matching evenly distributes the required similarity computations to all the reducers in the matching step so as to remove the bottleneck of skewed matching computations for a single node in a MapReduce framework. Our experiments using the well-known CiteSeerX digital library show that the proposed algorithm is both efficient and scalable.. words: Entity resolution, cloud computing, MapReduce, load-balance. Introduction Entity resolution (abbreviated as ER), also known as data matching, de-duplication, identity resolution, or record linkage, is to identify all the manifestations referring to the same entity (Hernández and Stolfo ; Brizan and Tansel 00; Chaudhuri, Ganti, and Kaushik 00; Elsayed et al. 00; Vernica et al. 00; Zhang et al. 00). ER is crucial for data integration and has been applied in many applications including price comparisons, citation matching (Pasula et al. 0), etc. Take a price-comparison website for example, to provide all the prices of a product crawled from different web-pages, the website needs to identify whether the two product-pages from two product websites referring to the same product. Given n product-pages in website P and n product-pages in website Q, n *n similarity computations are required to Copyright (c) 0, Australian Computer Society, Inc. This paper appeared at the th Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand, January 0. Conferences in Research and Practice in Information Technology (CRPIT), Vol.. B. Javadi and S. K. Garg, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included. identify the pages for the same products by entity resolution. The total number of similarity computations can be massive, considering the scale of web-pages nowadays. In general, a blocking-based ER is used to accelerate the similarity computations. Each entity in the blocking-based ER has an attribute or a specific value computed by hashing or other means called blocking key. The blocking-based ER comprises a blocking step and a matching step. The blocking step partitions all the entities in the dataset by blocking keys into several groups called blocks, where the entities in a block have the same blocking key. After that, the matching step performs similarity computations for all the s in each block and determines the similarities between entities within each block. The result is a set of all the s of high similarity. The blocking key in fact reduces the total number of similarity computations required in the blocking-based ER. Most of the studies on blocking-based ER assume that there is only one blocking key associated with an entity. In reality, an entity may have several blocking keys. An entity can be a record in a database, a web-page in a web-site, an article in a research repository, etc. A product page may be classified to multiple categories so as to have several attribute keys; an article in DBLP or CiteSeerX generally contains many keywords or index terms. In addition, the titles and some attributes like authors of the articles can serve as blocking keys, too. When the entities have a number of blocking keys, blocking-based ER can be performed more efficient since two entities can be a similar pair only if they share several common keys. Nevertheless, the blocking step of ER will become more time-consuming if multiple blocking keys are treated as many individual blocking keys, which consideration is adopted in most ER algorithms today. Traditional solution generates a pair for further matching if two entities share a common blocking key. Because there are more than one key associated with an entity now, the possibility of sharing common keys with other entities, which also contain many keys, is greatly increased. Consequently, the number of s generated for matching will be greatly increased. We consider that blocking keys represent specific features of entities, multiple blocking keys can be used in the blocking step to sort out the entities need to be matched so that the matching step can be accelerated. A typical ER problem is the pairwise document similarity (Baraglia et al. 00; Lin 00; Elsayed et al. 00), which detects similar documents in a collection like DBLP, CiteSeerX, or Google Scholar. The collection of articles is continuously growing as research articles keep
2 CRPIT Volume - Parallel and Distributed Computing 0 appending in a surprising speed. The ER process performed by a single machine suffers from the vast amount of the increasing data. The total number of similarity computations for the potential pairs becomes very huge. For example,. million publication records in the CiteSeerX collection will require billion similarity computations. The problem is exacerbated if the data size accumulates up to terabytes or petabytes. Thus, it is necessary to devise a blocking-based ER algorithm using the MapReduce framework (Dean and Ghemawat 00) in an open architecture like Hadoop (Hadoop 0). Although the blocking-based ER using the MapReduce framework may solve the problem of huge dataset size, the process may still suffer from the imbalance of key distributions. Commonly, the map phase and the reduce phase are the two core processes of a MapReduce program. An input record is a form of (key, value) pair and the key may come from a different domain. The process responsible for the map function is referred to a mapper and that for the reduce function is referred to a reducer for convenience. A mapper reads one block of (key, value) pairs, and produces a list of intermediate (key, value ) pairs after processing. Each (key, value ) pair having the same key will be sent to the same reducer. A reducer accepts the intermediate key with corresponding list of value s, processes and generates the output. A default hash partitioning is used in the MapReduce framework to partition the intermediate (key, value ) pairs to the reducers. Each reducer might be responsible for its respective 00 (key, value ) pairs when there are 0000 keys and 00 reducers. If certain keys are more popular than the others, a large number of (key, value ) pairs will be sent to some reducer so that the ER task cannot finish until this reducer accomplishes matching. In fact, the bottleneck of the ER process is on the reducer having the largest number of assigned (key, value ) pairs. To our knowledge, few studies have taken account of the load-balance problem among reducers, especially for the ER problem. The efficiency of the whole ER process can be remarkably improved when the working loads of the reducers in the matching step are balanced. Therefore, in this paper, we propose a MapReduce algorithm to solve the ER problem for a huge collection of entities with multiple keys. The algorithm features in the combination-based blocking and the load-balanced matching. The combination-based blocking utilizes the multiple keys to filter out unnecessary s. The load-balanced matching evenly distributes the required similarity computations to all the reducers in the matching step. Our experiments using CiteSeerX show that the proposed algorithm is efficient and scales up linearly with respect to the dataset size. The rest of the paper is organized as follows. Section defines our problem. Related works are briefly reviewed in Section. Section presents the proposed algorithm. The experimental results are given in Section. Section concludes this study. Problem Definition A dataset D = {E, E,, E m } contains m entities, where an entity E i ( i m) has E i blocking keys. The blocking keys of E i is represented by B Ei = {k, k,, k Ei } and E i >. Let the minimum number of blocking keys of the entities in D be kc, kc >. Given a similarity measure and a similarity threshold θ, the ER problem is to find out all the similar s in D. An (E i, E j ) is considered similar if sim(e i, E j ) θ, where sim(e i, E j ) is the similarity value between E i and E j computed using the similarity measure. Particularly, the characteristics of blocking keys implies, in our problem, that sim(e i, E j ) < θ if B Ei B Ej < kc. That is, E i and E j cannot be similar if the number of common blocking keys between them is less than kc. Accordingly, the similarity computation sim(e i, E j ) can be eliminated in the ER process if B Ei B Ej < kc. A typical similarity measure between E i and E j is the Jaccard similarity coefficient, which is defined as the size of their intersection divided by the size of their union. Assume the entities are text research articles, the Jaccard similarity coefficient can be calculated by the number of common terms divided by the total number of terms in the two articles. The blocking keys of the entities can be title words or keywords of the articles. For example, given D = {E, E,, E 00 } and kc =. We have 00 entities and the minimum number of blocking keys of the entities in D is. Assume E has blocking keys B E = {, d, e}, E has blocking keys B E = {, e}, and E has blocking keys B E = {d, e}, etc. Then E =, E =, and E = ; B E B E = {, e}, B E B E = {d, e}, and B E B E = {e}. The similarity computations need to be performed on both (E, E ) and (E, E ) but not (E, E ) since B E B E = < kc. If the similarity threshold θ = 0., sim(e, E ) = 0., and sim(e, E ) = 0., (E, E ) is a similar pair and is inserted into the answer set. Note that the ER problem definition is the same as others. Nevertheless, our problem emphasizes on the variable number of blocking keys in entities. Only entities having certain number of common keys are potentially similar. Related Works Although the ER problem is a long-existing problem, most solutions are not designed for extremely large collection of entities. Some algorithms have been presented for document-similarity computations (Baraglia et al. 00) and blocking-based ER solution under the MapReduce framework (Kiefer et al. 00; Kim and Shim 0; Lu et al. 0; Metwally and Faloutsos 0; Vernica et al. 00; Zhang et al. 00). These works assume that there is only one key for an entity and use a map/reduce phase to handle the problem. These works design a map function for the blocking step and then a reduce function for the matching step. Some of the notable works includes sorted neighborhood (Kolb et al. 0a) and load-balanced ER (Kolb et al. 0a; Kolb et al. 0b). Sorted neighborhood is a blocking technique by sorting all entities according to blocking keys, assigning a window size w and comparing entities in the window while sliding. Some entities with similar but not same blocking keys might be compared for similarity evaluations. In (Kolb et al. 0a; Kolb et al. 0b), the sorted neighborhood is implemented under the MapReduce framework while
3 Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand entities crossing boundaries have to be duplicated to avoid missing potential pairs due to reducers limitations (Kolb et al. 0a). Both JobSN using two MapReduce phases and RepSN using one MapReduce phase are presented. However, the problem of imbalanced loads among reducers in MapReduce is not mentioned. BlockSplit and PairRange (Kolb et al. 0b) consider the load-balancing problem using the MapReduce framework for single key ER. A block distribution matrix needs to be distributed using the distributedcache technique in MapReduce so that the entities can be evenly distributed to all the reducers. PairRange outperforms BlockSplit because PairRange guarantees that all reducers may receive the same number of entities (Kolb et al. 0b). The study in (Kolb et al. 0a) indicates that multiple keys may exist in an entity. However, the multiple keys are treated as independent like several single keys so that duplicate distributions appear in the blocking step. The MultiRepSN (Kolb et al. 0a) is proposed to overcome an entity with multiple blocking keys. Although reducer-loads are balanced, the algorithm might suffer from the problem of largely duplicated s. Thus, the study (Kolb et al. 0) presents an algorithm for ER with redundancy-free matching. The idea is to enumerate all candidate sets of entities with the index of the smallest common candidate sets. The mapper will emit (blocking keys, entity values) for the reducers. The entity value includes the entity and the smallest common blocking keys. Thus, Redundancy-Free Matching (Kolb et al. 0; Kolb and Rahm 0) may decrease the number of duplicate s. However, when a reducer receives an overwhelming entities, the execution time can be long. Therefore, the issue of load-balancing among reducers must be considered for an ER solution under the MapReduce framework with multiple keys. The Proposed Algorithm. Overview of the Algorithm The proposed algorithm for blocking-based ER utilizes the multiple blocking keys in each entity for an improved entity distribution in the blocking step. The distribution is more precise because an may exist in a block only when the number of common blocking keys between the pair exceeds certain threshold (i.e. kc). Because an entity may have more than kc keys, it needs to generate all the combinations of kc keys for potential key comparisons. The entity distribution procedure is called combination-based blocking in the proposed algorithm. After the blocking step, we aim to balance the working load among all the reducers for similarity computations in the matching step. The idea is to obtain a statistics of the total number of computations required for all the blocks first. We then evenly partition the computations among all the reducers to avoid potential overload of a reducer due to skewed key distributions. The procedure is called load-balanced matching in the proposed algorithm. Both combination-based blocking and load-balanced matching are designed using the MapReduce framework. Therefore, the algorithm comprises two map/reduce phases: a map/reduce phase for combination-based blocking, followed by a map/reduce phase for load-balanced matching. An overview of the proposed algorithm is shown in Fig.. D Map/reduce phase Combination-based Blocking Generate potential key combinations and distribute entity lists having common keys Map/reduce phase Load-balanced Matching Generate s combinations and evenly distribute similarity computations Matching result Fig. : An Overview of the Proposed Algorithm Map(key = E i, value = B Ei ) Input: P = a partition of entities in D; kc = minimum number of common keys. foreach E i P do. foreach K b = (a, a,, a kc ), where a p, a q B Ei a p a q p, q kc. output <K b, i> /* K b is a kc-combination from B Ei, i is the id of entity E i */. end. end Reduce(key = K b, value = list of entity ids). foreach key K b do /* initial entity_list = */. foreach entity id v in the list of values do. add v to entity_list. end. if entity_list.count. output < K b, entity_list>. endif. end E i Fig. : Functions for Combination-based Blocking D, d, e a, b, e b,, f B Ei Map Map Map a, d c, d a, e c, e a, b a,c a,e c,e b,e b, c a, f Value Value Value Reduce Reduce a, e c, e Output E i- list (,,, ) Fig. : An Example for Combination-based Blocking. Combination-based Blocking Figure shows the map/reduce functions in the proposed combination-based blocking procedure. A mapper receives its partition of entities in the dataset. Each input data is an entity and the associated blocking keys. The map function generates all the kc-combinations from the set of (, ) (, ) (, ) E i- list (, ) (, )
4 CRPIT Volume - Parallel and Distributed Computing 0 blocking keys. Each combination is outputted with the id of the entity. For example, <(a,c), >, <(a,e), >, and <(c,e), > are outputted for entity E with B E = {, e} when kc = ; <(, e), > is outputted for E and <(, d), >, <(, e), >, <(a, d, e), >, and <(c, d, e), > are outputted for E with B E = {, d, e} when kc =. In fact, E and E need to be compared since they have three common blocking keys when kc =. Thus, all the kc-combinations from B Ei for entity E i need to be generated. The reduce function of the procedure further determines those entities having at least kc common keys to form a potential list for the matching step. The reducer would receive a kc-combination as the key and the list of entity ids having that combination. If the list contains only one entity for a certain key combination, the entity is eliminated from the matching step obviously. Therefore, we count the number of entities in the entity_list (entity_list.count) while adding entities of the same K b to the associated entity_list. Only entity_list having two or more entities will be sent to the load-balanced matching. An example showing the map/reduce functions with nine entities, three mappers, two reducers, and kc = is displayed in Fig. Mapper generates <(a,c), >, <(a, d), >, <(c, d), >, and so on; Mapper generates <(), >, <(a,e), > and so forth. A simple hash partitioning is adopted for the reducers. Assume Reducer accepts keys (), (a, e), and so on. Hence, entity_list (,,, ) of the same key () will be sent to the next procedure, load-balanced matching.. Load-balanced Matching The matching step has to compute similarities for the pairs of all the -combnations of each entity list generated by the blocking step. A common implementation of the matching procedure uses the default hash partitioning to pass the intermediate keys to reducers. Consequently, the reducer receiving long entity-lists suffer from the vast amount of similarity computations. A skewed key distribution also leads to imbalanced workloads for reducers. Hence, the matching step cannot finish until the reducer of the heaviest workload completes. Take the entity_lists in Fig. for example, reducer produces s (including (,), (, ), (, ), (, ), (, ), (, ), and (,)) but reducer produces only pairs for similarity computations. The blocking-based ER can be completed earlier without unnecessary waiting if the workloads are balanced among reducers in the matching step. Figure shows the map/reduce functions in the proposed load-balanced matching procedure. Note that the partition function, which replaces the default hash partitioning, is particularly designed to evenly distribute the total number of comparisons to the total number of reducers (i.e. numreducetasks). In order to average the workloads, we use a partition number to count and mark each input record (). The total number of s for matching is then obtained by the. Note that we prune the duplicate pair (, ) from keys (a, e), and (c, e) here. The mapper generates (, ) as the (key, value) for the reducers but applying the designated partitioning for load balancing. Entity pairs are sent to all the receivers by round robin, as shown in the partition function in Fig.. The reducers then outputs the similar pairs if their similarity value is no less than the similarity threshold. Figure shows an example of load-balanced matching with two mappers and two reducers. The matching receives entity lists in Fig.. A mapper generates the <, > by enumerating all the -combinations of an entity list with associated, which value is increased by for each combination. The entity list (,,, ) will produce (, ), (, ), (, ), (, ), (, ), and (, ) s. Thus, <, (, )>, <, (, )>, <, (, )>, <, (, )>, <, (, )>, and <, (, )> are generated by mapper; <, (, )>, <, (, )>, and <, (, )> are generated by mapper. The partitioning determines which reducer to receive the <, > using the designed partition function. Because the total number of reducers is two, <, > having odd is processed by reducer, and that having even is processed by reducer. Assume that the similarity value of sim(e, E ), sim(e,e ), sim(e, E ), and sim(e, E ) is less than the similarity threshold. The matched result contains similar pairs (, ), (, ), (, ), (, ), and (, ). If we configure the MapReduce job with reducers, the computations will be evenly distributed to the reducers. Map(key = K b, value = entity list E i _list). foreach entity_list E i _list do /* generate all -combination s from E i _list */. generate all -combination s EPS = { (Ep, Eq) Ep, Eq E i _list Ep Eq }. foreach (Ep, Eq) in EPS do /*initial = */. ++;. output <, (Ep, Eq) >. end. end Partition(key =, value = (Ep, Eq) ) /* numreducetasks is the number of reducers, defined by MapReduce configuration */ return (.hashcode() & Integer.MAX_VALUE) % numreducetasks; Reduce(key =, value = (Ep, Eq) ). foreach key do. foreach value v in v s value list do. output (Ep, Eq) if sim(ep, Eq) θ. end. end a, e c, e Fig. : Functions for Load-balanced Matching D (,,, ) (, ) (, ) (, ) (, ) (,) Map Map Reduce Reduce (, ) (, ) (, ) (, ) (, ) (, ) (,) (, ) (, ) (, ) (, ) (, ) (, ) (, ) Output Partitioning Note: sim(e, E ) < θ sim(e, E ) < θ sim(e, E ) < θ sim(e, E ) < θ (, ) (, ) (, ) (, ) (, ) Fig. : An Example for Load-balanced Matching (, ) (, ) (, ) (, )
5 Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand Experimental Results Extensive experiments were conducted to assess the performance of the proposed algorithm. Our experiments were performed in a 0-nodes cluster which contains three types of nodes, as shown in Table. All nodes run in Ubuntu 0.0, Hadoop , and Java..0. Real dataset CiteSeerX (CiteSeerX 0) was used in the experiments. CiteSeerX contains about. million publication records of total size. GB. Each publication record includes a record id, title, keywords, abstract, and URL information. The record id and its title were extracted and used as input because some records have no keywords. The stop-words in titles were removed and then the rest of the title words were used as keywords, which serve as the blocking keys. A record has. keys in average; the maximum number of keys is 0. Very few records contain only one key and they were skipped for the ER process since the minimum number of common keys kc is greater than one in our problem definition. Config. nodes nodes nodes CPU Memory Hard disk Intel Core i Intel Core i Intel E00. GHz *. GHz *.0 GHz *. GB SATA 00G mappers ( M ) was 0, and that of reducers ( R ) was. The total execution time required for both combination-based blocking (legend blocking) and load-balanced matching (legend load-balanced) are displayed. Both times increases as kc increases. As mentioned above, the number of resulting pairs decreases rapidly as kc increases so that the real matching time decreases sharply in fact. Table indicates that the maximum load of the reducers decreases from MB to MB for kc = after applying load-balanced matching. The data size to be handled before applying the proposed load-balanced matching is. times of that after applying the matching. In fact, the output data size after the combination-based blocking is MB for kc =, and up to MB for kc =. Without proper balancing of the reducers, many reducers would have to wait for the maximum-loaded reducer to complete its job. Take kc = for example, before applying the balanced-matching, some reducers worked with nearly zero-sized data combinations and some were heavily loaded with MB data. The overall process cannot finish until this heavy-loaded reducer completes. Figure depicts that the workloads can be effectively balanced so that the maximum load of the reducers can be decreased for all the settings of kc. Table : Experimental Environments Same as the experiments in most studies on ER with MapReduce (Kolb et al. 0a; Kolb et al. 0a; Kolb and Rahm 0), the execution time of comparing entities is not included in the result. That is, in the following context, the execution time accounts for the total time needed to find out the set of all the potentially similar s. In practice, after discovering these pairs, the complete ER process has to perform similarity computations like computing Jaccard coefficients for all these pairs. Clearly, a smaller number of the pairs results to a faster matching result. When the minimum number of common blocking keys (kc) is larger, the number of resulting pairs will be smaller. Nonetheless, a larger kc often indicates that the total number of combinations can be larger. Table shows that.k combinations are generated and.k resulting pairs need to be matched for kc =. There are.k combinations are generated and only.k resulting pairs need to be matched when kc is increased to. When kc is increased to, the number of combinations is the largest of.k but only.k pairs need further matching. When kc is increased to, the number of pairs required matching decreased to.k. Blocking keys #combinations,,0, 0, #pairs,0,,, Table : Combinations and Results vs. Blocking s Fig. : Performance on Varying the Number of Blocking s Table indicates that the maximum load of the reducers decreases from MB to MB for kc = after applying load-balanced matching. The data size to be handled before applying the proposed load-balanced matching is. times of that after applying the matching. In fact, the output data size after the combination-based blocking is MB for kc =, and up to MB for kc =. Without proper balancing of the reducers, many reducers would have to wait for the maximum-loaded reducer to complete its job. Take kc = for example, before applying the balanced-matching, some reducers worked with nearly zero-sized data combinations and some were heavily loaded with MB data. The overall process cannot finish until this heavy-loaded reducer completes. Figure depicts that the workloads can be effectively balanced so that the maximum load of the reducers can be decreased for all the settings of kc. Figure shows the total execution time with respect to the number of common blocking keys (kc). The number of
6 CRPIT Volume - Parallel and Distributed Computing 0 kc Before 0.M.M.M.0M After.M M.M M Table : Changes of Maximum Load of the Reducers Next, the number of reducers was varied to evaluate the effects on total execution time. Figure shows that the total execution time decreases as the number of reducers increases. As expected, more reducers may speed up the process. Note that the experiment uses an extreme large dataset by replicating CiteSeerX five times so that the differences of execution times can be exploited. When the size of input data is not very big, a MapReduce configuration with many reducers actually may cause extra-communications among the reducres. Thus, the dataset was enlarged to highlight the effects of the number of reducers. Fig. : Effects of Load-balanced Matching Fig. : Scalability of the Proposed Algorithm The experimental result of evaluating the scalability of the proposed algorithm is shown in Fig.. The dataset CiteSeerX is replicated for and 0 times, respectively. In Fig., the total execution time increases linearly as the dataset size increases. Conclusion In this paper, we propose an algorithm to solve the entity resolution problem for big data analytics, using the MapReduce framework. The characteristic of multiple keys in entities is presented and utilized for effective key blocking (entity distribution) to improve the entity resolution process. The proposed algorithm features in the combination-based blocking and load-balanced matching. The combination-based blocking produces potential key combinations and distributes entity lists having common keys. The load-balanced matching generates entity-pairs combinations and balances the workloads of similarity computations among all the reducers in the MapReduce configuration. The experimental results show that the proposed algorithm may efficiently solve the entity resolution problem. Future works may extend this research to the entity resolution problem for the RS-join, such as the join between DBLP and the CiteSeerX. Acknowledgement The authors are grateful for the comments of the reviewers for improving the quality of the paper. This study is supported partly by the National Science Concil, Republic of China, under grant no. NSC0-0-H--00. Fig. : Performance on Varying the Number of Reducers References Baraglia, R., De Francisci Morales, G., and Lucchese, C. (00): Document similarity self-join with MapReduce. Proc. IEEE International Conference on Data Mining: -. Brizan, D. G. and Tansel A. U. (00): A survey of entity resolution and record linkage methodologies. Communications of the International Information Management Association (): -0. Chaudhuri, S., Ganti, V., and Kaushik, R. (00): A primitive operator for similarity joins in data cleaning. Proc. International Conference on Data Engineering: -.
7 Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand Dean, J. and Ghemawat, S. (00): MapReduce: simplified data processing on large clusters. Proc. Symposium on Operating System Design and Implementation: -0. Elsayed, T., Lin, J., and Oard, D. W. (00): Pairwise document similarity in large collections with MapReduce. Proc. Association for Computational Linguistics on Human Language Technologies: -. Hernández, M. A. and Stolfo, S. J. (): The merge/purge problem for large databases. Proc. ACM SIGMOD International Conference on Management of data, -. Kiefer, T., Volk, P. B., and Lehner, W. (00): Pairwise element computation with MapReduce. Proc. High Performance Distributed Computing: -. Kim, Y. and Shim, K. (0): Parallel top-k similarity join algorithms using MapReduce. Proc. International Conference on Data Engineering: 0-. Kolb, L. and Rahm, E. (0): Parallel Entity Resolution with Dedoop. Datenbank-Spektrum (): -0. Kolb, L., Thor, A., and Rahm, E. (0): Don t match twice: redundancy-free similarity computation with MapReduce. Proc. Workshop on Data Analytics in the Cloud: -. Kolb, L., Thor, A., and Rahm, E. (0a): Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endowment ():. Kolb, L., Thor, A., and Rahm, E. (0b): Load balancing for MapReduce-based entity resolution. Proc. International Conference on Data Engineering: -. Kolb, L., Thor, A., and Rahm, E. (0a): Multi-pass sorted neighborhood blocking with MapReduce. Computer Science - Research and Development (): -. Kolb, L., Thor, A., and Rahm, E. (0b): Parallel sorted neighborhood blocking with MapReduce. Proc. Database Systems for Business, Technology, and Web: -. Lin, J. (00): Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval: -. Lu, W., Shen, Y., Chen, S. and Ooi, B. C. (0): Efficient processing of k nearest neighbor joins using MapReduce. Proceedings VLDB Endowment (0): 0-0. Metwally, A. and Faloutsos, C. (0): V-SMART-Join: a scalable MapReduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endowment (): 0-. Pasula, H., Marthi, B., Milch, B., Russell, S. and Shpitser, I. (00): Identity uncertainty and citation matching. Proc. Neural Information Processing Systems: 0-0. Vernica, R., Carey, M. J., and Li, C. (00): Efficient parallel set-similarity joins using MapReduce. Proc. ACM International Conference on Management of Data: -0. Zhang, Q., Zhang, Y., Yu, H., and Huang, X. (00): Efficient partial-duplicate detection based on sequence matching. Proc. ACM SIGIR Conference on Research and Development in Information Retrieval: -. CiteSeerX dataset URL. (Accessed: 0/0/0). Hadoop URL. (Accessed: 0//0).
Learning-based Entity Resolution with MapReduce
Learning-based Entity Resolution with MapReduce Lars Kolb 1 Hanna Köpcke 2 Andreas Thor 1 Erhard Rahm 1,2 1 Database Group, 2 WDI Lab University of Leipzig {kolb,koepcke,thor,rahm}@informatik.uni-leipzig.de
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
Image Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
Distributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE
RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients
Evaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang [email protected] University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Survey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
How To Balance In Cloud Computing
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi [email protected] Yedhu Sastri Dept. of IT, RSET,
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Redundant Data Removal Technique for Efficient Big Data Search Processing
Redundant Data Removal Technique for Efficient Big Data Search Processing Seungwoo Jeon 1, Bonghee Hong 1, Joonho Kwon 2, Yoon-sik Kwak 3 and Seok-il Song 3 1 Dept. of Computer Engineering, Pusan National
SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG
SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG MARTIN JUNGHANNS, ANDRE PETERMANN, ERHARD RAHM www.scads.de RESEARCH ON GRAPH ANALYTICS Graph Analytics on Hadoop (Gradoop) Distributed graph data management
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Approaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies [email protected] This paper
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia
Hadoop on a Low-Budget General Purpose HPC Cluster in Academia Paolo Garza, Paolo Margara, Nicolò Nepote, Luigi Grimaudo, and Elio Piccolo Dipartimento di Automatica e Informatica, Politecnico di Torino,
Efficient Analysis of Big Data Using Map Reduce Framework
Efficient Analysis of Big Data Using Map Reduce Framework Dr. Siddaraju 1, Sowmya C L 2, Rashmi K 3, Rahul M 4 1 Professor & Head of Department of Computer Science & Engineering, 2,3,4 Assistant Professor,
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Efficient Data Replication Scheme based on Hadoop Distributed File System
, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
Accelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
Implementing Graph Pattern Mining for Big Data in the Cloud
Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya [email protected]
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Manifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
The Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen [email protected] Abstract. The Hadoop Framework offers an approach to large-scale
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Data Backup and Archiving with Enterprise Storage Systems
Data Backup and Archiving with Enterprise Storage Systems Slavjan Ivanov 1, Igor Mishkovski 1 1 Faculty of Computer Science and Engineering Ss. Cyril and Methodius University Skopje, Macedonia [email protected],
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB
How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA. Technology, Coimbatore. Engineering and Technology, Coimbatore.
A REVIEW ON EFFICIENT DATA ANALYSIS FRAMEWORK FOR INCREASING THROUGHPUT IN BIG DATA 1 V.N.Anushya and 2 Dr.G.Ravi Kumar 1 Pg scholar, Department of Computer Science and Engineering, Coimbatore Institute
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
Distributed file system in cloud based on load rebalancing algorithm
Distributed file system in cloud based on load rebalancing algorithm B.Mamatha(M.Tech) Computer Science & Engineering [email protected] K Sandeep(M.Tech) Assistant Professor PRRM Engineering College
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
A Hybrid Load Balancing Policy underlying Cloud Computing Environment
A Hybrid Load Balancing Policy underlying Cloud Computing Environment S.C. WANG, S.C. TSENG, S.S. WANG*, K.Q. YAN* Chaoyang University of Technology 168, Jifeng E. Rd., Wufeng District, Taichung 41349
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique
A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique Jyoti Malhotra 1,Priya Ghyare 2 Associate Professor, Dept. of Information Technology, MIT College of
Map/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
Figure 1. The cloud scales: Amazon EC2 growth [2].
- Chung-Cheng Li and Kuochen Wang Department of Computer Science National Chiao Tung University Hsinchu, Taiwan 300 [email protected], [email protected] Abstract One of the most important issues
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster
Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster Sudhakar Singh Dept. of Computer Science Faculty of Science Banaras Hindu University Rakhi Garg Dept. of Computer
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
