Record Linkage in an Hadoop Environment
|
|
|
- Wilfrid Watson
- 10 years ago
- Views:
Transcription
1 Undergraduate Research Opportunity Program (UROP) Project Report Record Linkage in an Hadoop Environment By Huang Yipeng (U096926R) Department of Computer Science School of Computing National University of Singapore 2010/11
2 Undergraduate Research Opportunity Program (UROP) Project Report Record Linkage in an Hadoop Environment By Huang Yipeng (U096926R) Department of Computer Science School of Computing National University of Singapore 2010/11 Project No: U Advisor: A/P Min-Yen Kan, Ng Jun Ping Deliverables: Report: 1 Volume Source Code: 1 DVD
3 Abstract The recent development of the MapReduce paradigm has lessened the difficulty of distributed computing, providing users with a black box capable of handling many difficulties with parallel computing. However, little work has been published on MapReduce-based record linkage. We study how the generic MapReduce framework can be tailored for record linkage problems. In particular, we note that blocking-based parallelism of record linkage problems is hampered by uneven partitioning. We introduce a partitioning solution that dynamically balances the record comparison workload and distributes it using subset replication and match-based parallelism to spread out data skew. Our evaluation shows that our solution consistently outperforms the baseline for sufficiently skewed distributions. Subject Descriptors: H.2.4 [Database Management]: Systems concurrency, parallel databases H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords: Record Linkage, MapReduce, Performance Implementation Software and Hardware: CentOS, Java, Hadoop
4 Acknowledgement I would like to thank to my supervisors A/P Kan Min-Yen and Mr Ng Jun Ping, and also acknowledge the rest of the WING group for their kind feedback.
5 List of Figures 3.1 Relation of Record Linkage to MapReduce stages A De-duplication Example Balanced partitions / Uneven task workloads Improved task partitioning (First Pass) Improved task partitioning (Second Pass) Grid representation of remaining workload after the first pass Reduce task running time after Hash partitioning Frequency distribution of first names iv
6 List of Tables v
7 Table of Contents vi
8 Chapter 1 Introduction Databases contain a trove of information that help us to understand specific domains or world knowledge, particularly when information is merged across silos. The best observations are made when analysis is performed on actual data. Data derived from different databases may not be mutually exclusive and must be cleaned to prevent erroneous results. Thus there is a need for linking records across databases to identify similar records and duplicates. Record linkage (also known as record matching, duplicate record detection, name disambiguation, data cleaning, merge-purge, entity resolution, object matching, and approximate text join) is the process of making connections between similar records. Along with multiple databases residing on multiple computers comes the possibility of distributed computation for more efficient record linkage to help manage inefficiencies in processing large datasets. The idea has previously been examined (Kim & Lee, 2007; Kawai, Garcia-Molina, Benjelloun, & Menestrina, 2006; Christen, Churches, & Hegland, 2004) but has drawn little attention from academia in the last couple of years. One possible explanation is that the difficulty of distributed computing has made it, until recently, more trouble than it was worth. These difficulties include efficient data distribution, fault tolerance, dynamic load balancing, portability and scalability (Christen et al., 2004). Thankfully the MapReduce paradigm has rescued its users from many of the above-mentioned concerns. Additionally, MapReduce has enjoyed popularity since its introduction, leading to the creation of open source implementations such as Hadoop (Borthakur, 2007). This suggests that the time may be ripe to revisit parallel record linkage. 1
9 In this paper, we study how the generic MapReduce framework can be tailored for record linkage problems. In particular, we note that effective parallelism of record linkage problems is hampered by uneven partitioning. Suppose, for example, that we are given two machines and the following records for de-duplication: {(Emily,..., 1), (Emily,..., 2), (Joshua,..., 1),... (Joshua, n)}. By selecting an appropriate key that is shared by duplicate records, we can reduce the comparison-space. E.g. if the name field is selected as the key, we avoid comparing Joshua records with Emily records. Consequently, records are split into the following partitions: machine 1: {(Emily,..., 1), (Emily,..., 2)} machine 2: {(Joshua,..., 1),... (Joshua, n)} This occurs because MapReduce assigns records with the same key to the same partition (property 1) to be compared, and creates non-overlapping partitions. Skewed workloads follow when
10 4) We propose an overall load balancing solution for record linkage problems that combines the RLP, subset replication, match-based parallelism, and a merging phase. It reduces the absolute runtime on highly skewed datasets by increasing the utilization of machines assigned to a problem. This paper is organized as follows. In Section 2, we survey related work. In Section 3, we describe the RLP and its associated partitioning solution. In Section 4, we evaluate the overall effectiveness of our solution for record linkage problems. Finally, in Section 5, we discuss future directions before concluding in Section 6. 3
11 Chapter 2 Related Work The scope of record linkage work is broad, hence we will only highlight a selection of fundamental and current developments before diving into the application of record linkage to large datasets, and explaining the weaknesses of existing approaches. 2.1 Identifying Record Linkages To understand how and when should linkages be made, it is helpful to think of the linkage problem as a duplication problem. Records that match exactly should be linked and one considered a duplicate. When trying to identify duplicates, we perform a pairwise comparison of two records. Unfortunately, simply looking for exact matches may not weed out all duplicates; Linkage must also consider inexact matches within a threshold of similarity. A further explanation of this can be found in (Mehrotra). The need for inexact matching was first suggested by Newcombe (Newcombe1959), who linked marriage and birth records to better understand relationships within a human population. He noted two problems, 1) ambiguous matches such as when different individuals shared the same name and 2) typographical errors preventing records from being associated with the same individual. The two problems can be thought of as the requirement for precision (identifying a linkage as match) and recall (identifying all matching linkages). Correct and comprehensive identification are at the heart of record linkage problems because they determine the linkage s 4
12 usefulness. As a testament to Newcombe s perception, these problems are still being addressed today in pairwise comparison and clustering stages (Jain1999) of record linkage. Newcombe s linkage method was simple. He just subtracted the frequency of a record s agreement from the frequency of its disagreement. He based this on six fields (name, country, age, etc), and noted that no individual field was reliable in connecting two records. The outcome of his experimentation was very positive. 98.3% were genuine linkages and only 0.7% were false positives. That said, his early success did not mean that the same quality could be expected on more generic datasets across the board. Newcombe did not explain how he selected the six fields used for comparison, and the dataset he used was probably one of high quality since it related to the medical domain where mistakes and blank fields are understandably less prevalent. Fellegi and Sunter, following Newcombe s intuitions, showed the optimality of the probabilistic decision rule and developed its formal mathematical model (Fellegi1969). The next major development came from Winkler s adaptation of the Expectation-Maximization algorithm to estimate parameters which tightened probabilistic decision rules (Winkler1991), and in the last 50 years, an abundance of string comparators (Elmagarmid2007), which are used to identify linkages, have been developed and these metrics have become highly accurate. More recently, approaches to record linkage include unsupervised graphical methods (Ravikumar2004), supervised vector-space methods (Han2004), multiple passes (On2005), and the exploitation of internal (Taskar2003) and external (Tan2006) sources of information. In Newcombe s own experiments, he found record linkage to reliable. Speed was a problem, and Newcombe looked expectantly to faster computer speeds for improvement to the speed of record linkage. It seems that many of the Newcombe s first intuitions still hold today. Accuracy is less of a problem as compared to efficiency. However, while our processing power has gone up, so has the size of datasets in general. For example, the Internet Archive website reports that it contains almost 2 petabytes of data and is growing at a rate of 20 terabytes per month. Having identified early and more current developments in record linkage I will describe its application to large datasets and explain why it is a problem. Readers looking for a more comprehensive survey of record linkage should read (Elmagarmid2007Winkler2006). 5
13 2.2 Dealing with Large Data Consider the deduplication of a single list. The pairwise comparison in this record linkage task is an O( n C 2 ) operation, where n is the total number of records. As n grows large, the comparison step becomes increasingly expensive, because each record must be compared against every other record. To alleviate this, less costly operations have been developed to reduce the size of n and bring down the runtime. This includes blocking (Jaro1989), sorted neighborhoods (Hernandez1998), and canopies (McCallum2000b). These methods all work by reducing the problem search-space. Blocking uses an approximate distance metric to create subsets and limit a record s comparison to other records within the subset. Canopies extend blocks to have overlapping subsets. Sorted neighborhoods first sort records with a key so similar records have spatial locality, and puts this list into a scanning window that only compares records that fit within it. These techniques have helped to reduce the algorithmic runtime on record linkage problems but as datasets grow even larger, efficiency is still an recurring problem. Other complementary techniques are needed. Since the above-mentioned techniques are not the focus of this paper, we will not compare them (See (Baxter2003) for comparative survey). Instead we would like to point out some limitations that these techniques share. Firstly they have been designed to run on a single machine. Consequently, even though extra machines may be available, they cannot be utilized. If such idle computational time is utilized, computational bottlenecks can be prevented elsewhere. Secondly, none of these techniques consider the possibility that data is stored in different locations a situation that reflects the world-reality of databases today. Lastly, these techniques are not complementary methods of reducing runtime. They belong to one category of techniques: Blocking. We suggest parallelism as a complementary approach to the approaches described in this section. 6
14 2.3 Shift to Parallel Computing Parallel computing is bounded by Amdahl s law (Amdahl1967) and constrained by synchronization and communication overhead (Flatt1989). Consequently, simply adding nodes into a task will not lower runtimes beyond a certain point. However, this does not mean that significant improvements cannot be made. The main objective of parallelism is speedup (the reduction of algorithmic runtime) and scaleup (the increase in work capability to machine ratio) (De- Witt1992). A case in point is (Kim2007) which reports that their parallelized record linkage algorithms were able to run 6.55 times faster than sequential algorithms with multiple CPUs. While parallel algorithms have been developed for many problems, very little literature seems to be devoted to parallel record linkage. Peter Christen was a front-runner who began to parallelize Febrl (Christen2004), a record linkage system, in It was parallelized with MPI and able to achieve near linear improvement on costly pairwise comparison and classification operations (linear is ideal, see (DeWitt1992)) but its memory usage did not scale well as the dataset grew. Peter first introduced block-based parallelism in in (Christen2002a). The problem of scaling blocking-based parallelism (which distributes and processes different blocks in parallel) is essentially the data skew problem. Sadly Ferbl s parallel direction seems to be discontinued. Retrospectively, this is not surprising since (Christen2004) began its section on parallelism with a long list of parallel programming issues yet to be addressed including efficient data distribution, fault tolerance, dynamic load balancing, portability and scalability. P-Swoosh (Kawai2006) is a match-based parallel algorithm which divides comparison pairs over multiple machines. It is worth mentioning because it is a fairly comprehensive attempt define a suitable architecture for parallelizing record linkage. P-Swoosh had a novel method of matching, dividing tasks between nodes based on matches. Records were first compared serially within a window, and distributed computation utilized only on records that had already been identified as non-matching with the hope of achieving a complete rather than approximate result. It showed a 2x improvement in speedup with the use of domain knowledge but is not without its problems for practical use. P-Swoosh was implemented in a emulated, shared memory environment (machines share global memory) when shared-nothing (machines connected 7
15 by network) is currently the industry consensus for parallel database architecture. It will be interesting to see whether the validity of the performed experiments replicate well on real-world datasets and systems. Our intuition is that applying the combination of blocking-based parallelism (to quickly reduce the comparison-space) and matching-based parallelism (to spread out data skew) will enable us to enjoy the best of both worlds and obtain a further reduction in runtime. 2.4 Why MapReduce? Although neither (Christen2004Kawai2006Kim2007) used MapReduce for their experiments, we believe it is a viable solution for parallel record linkage because it abstracts away a significant amount of complexity from the parallel programming paradigm that has prevented other parallel solutions from moving forward. There is also support from academia to harness MapReduce for record linkage. A detailed argument for MapReduce s suitability can be found in (Lin2008), and two research groups, from the University of Maryland and University of California, Irvine have published papers related to record linkage using the MapReduce programming model (Vernica2010aLin2009Vernica2010); (Vernica2010) studied the performance of pairwise comparisons and achieved scaleup significantly better than (Christen2004). Another study found that running time increased linearly with collection size (Vernica2010a). Thus it appears that MapReduce offers a convenient model for scaling record linkage tasks. Most recently, (Kolb2010) parallelized sorted neighborhood blocking in MapReduce. However, there is still some way to go before a set of record linkage methodologies compatible with MapReduce is comprehensively developed. (Vernica2010a) and (Lin2009) focused on document collections, but experimentation with other types of data is still absent. (Vernica2010a) and (Vernica2010) have made substantial contributions by presenting an indexed approach and different join algorithms respectively but experiments have yet to cover different types of record matching problems, e.g. clean vs dirty lists, dirty vs dirty lists, as in (Kim2007). (Kolb2010) also noted three problems of atypical record linkage workflows in MapReduce. These are 1) disjoint data partitioning, 2) load balancing, and 3) memory bottlenecks. While 8
16 (Kolb2010) focused on disjoint partitioning and (Vernica2010) on memory bottlenecks, work that deals with the problem of balancing data skew in MapReduce-based record linkage is noticeably absent. Our work in this paper tackles problem 2, the load balancing of skewed distributions.
17 Chapter 3 Methodology 3.1 The MapReduce Programming Model MapReduce is a shared-nothing framework for distributed computing best suited for large scale data processing. All data must be process as (key, value) pairs. MapReduce has two primary user defined stages, Map and Reduce. Generally speaking, a problem is mapped out to multiple machines, partitioned into smaller units of work, and then reduced into a single solution. Programmatically, this is possible because the Map() and Reduce() functions are stable and can be executed as individual tasks on different machines. There is also a Partition() that is subsumed under the Map stage in most literature. The first challenge for parallel record linkage is to fit Record Linkage stages to the MapReduce model. Our implementation follows Figure 3.1. We suggest performing data canonicalization (the standardization of data) as a form of preprocessing simply because the Map stage provides a suitable platform for parallel data manipulation. However, Hadoop s ChainReducer class also provides users the ability to perform canonicalization later. The Partition stage is a suitable context to perform standard blocking because it outputs records in key-value pairs and channels records with the same key to the same Reduce task. To illustrate the synergy between Record Linkage and MapReduce, consider Figure 3.2. As with the previous and all future examples, blocking occurs on the first name. 10
18 Figure 3.1: Relation of Record Linkage (left) to MapReduce (right) stages Figure 3.2: A De-duplication Example 11
19 Four records are divided into two maps and sorted on their first name field (key). The output tuples of the Map and then shuffled to their respective Reduce task. This strategy is essentially blocking-based parallelism, since records with the same key must end up in the same block. The solution set is the simply the matching records from pairwise comparisons within the same Reduce. In this case, only two comparisons (one in each Reduce) are needed, and the output returns de-duplicated. 3.2 Parallelizing Data Skew Our goal is to ensure that task workloads are fairly even despite data skewness. When the distribution of records is uniform, Hadoop s default partitioner assigns a balanced workload to each (Reduce) task. However when the distribution is skewed, it fails to do so. This occurs because the partitioner is hashed on Hash(Key) mod M, where M is the number of machines. It follows that records with the same hash unavoidably end up in the same partition. This means that if there is a significantly larger number of records with Joshua (for example) as its first name than any other first name, the workload handled by the task assigned to process the Joshua partition will be greater than other tasks with the same number of records, and can even be greater than the combined workload of many other tasks because of pairwise comparison s n C 2 rate of growth. This is illustrated by Figure. The two columns on the left represent the distribution of records, and the two columns on the right are their corresponding workload in terms of comparisons. Within a partition, its task categorizes records into blocks based on their block key. Hence 7 Joshua records belong to one block, 2 Emily records to
20 Figure 3.3: Balanced partitions but uneven task workloads Our intuition is that better performance can be obtained if some domain knowledge is applied to the record partitioning strategy. Our system partitions based on the number of
21 larger. Figure 3.4: Improved task partitioning (First Pass) Figure 3.5: Improved task partitioning (Second Pass) First Pass (Balancing skew in blocking-based parallelism) Our overall objective is to reduce the effects of data skew on runtime. In the first pass, we do this by balancing the number of comparisons given to each data partition in the RLP. Since each Reduce task is assigned a separate partition, this allows us to balance the computational workload of the task, thereby ensuring tasks finish at approximately the same time. In our algorithm, the comparison workload of distinct blocks are computed and assigned in an online fashion to the partition with the fewest comparisons assigned thus far. If the distribution of the datasets is known a priori (as in our case), they can be used to compute the comparison workload; in cases where they are unavailable, they can be calculated by performing counts or sampling in pre-processing. Our system enables a more even record distribution among tasks. When the comparison workload of a block exceeds the average comparison workload per machine, our algorithm divides it up and assigns it to the fewest number of machines required for each cut to fall below the average comparison workload. We compute the average comparison workload by adding up the total comparison workload and dividing it by the number of available nodes. The average comparison workload must be decremented after each cut because comparisons only occur between records in the same block cut and comparisons between block cuts are left to the 14
22 next pass as in Figure 3.5. In addition to ensuring that no individual block exceeds that average workload per machine, our system also ensures that the total workload assigned to each machine does not exceed the average workload at any point. This can occur because record comparisons are balanced in an online fashion. When this fault occurs, we set the current average as the starting average and recurse from the beginning of the block assignment. Consequently, when larger blocks are broken into more partitions, more comparisons are lost in the first pass, and an even assignment of record comparisons becomes possible. This method has the following properties: 1) It eventually stabilizes to an average workload that ensures a similar computational load for each task and thus similar runtime for each task. 2) It restricts the amount of work in the next round, which handles cross-partition linkages, since it assigns block cuts to a minimal number of partitions Second Pass (Shifting to match-based parallelism) The objective of the second pass is to ensure that cross-duplicates from different partitions are detected. However, falling back to the standard MapReduce workflow is not efficient. Our system reduces runtime through a divide and conquer approach that increases the utilization of available machines. We illustrate this with an example. Suppose only records labeled Joshua exceed the average workload in the 1st pass; and the Joshua records are divided into 3 partitions A, B, and C. Then in the second pass, only comparisons of Joshua records between A, B, and C need to be processed. All other intra-partition linkages were previously considered by the first pass. The requisite comparisons can be represented with a grid as in Figure 3.6. Note that each block that exceeds the average workload should be represented on a separate grid. We mark the diagonal with x to reflect that Joshua records within partitions A, B, and C have already been de-duplicated in the first pass. This is unlike regular match-based parallelism that treats all squares on the grid equally. Additionally, since pairwise comparisons are symmetric, we can ignore the gray region. Hence, we only need to compare records in the white regions or {A, B}, 15
23 {A, C} and {B, C}. Figure 3.6: Grid representation of remaining workload after the first pass The standard MapReduce workflow requires all Joshua records to be processed by 1 Reduce task because they share the same block key. With reference to Figure 3.6, the entire lower triangular must be processed with a single machine. The availability of other machines is inconsequential because no work can be assigned to it. Our system applies domain knowledge from the first pass to utilize available machines. From the first pass, we know which keys, e.g. Joshua records, need further de-duplication and also which partitions they reside in. Hence, it is possible for us to supply as input only records that still need to be compared and assign a specific partition-comparisons sets to each machine as
24 3.4 Theory of utilization As shown above, the proposed record linkage workflow can return a better utilization than the atypical hash-based MapReduce workflow at handling data skew. We believe that higher utilizations will result in faster runtimes. In this section, we lay out our observations about the theoretical performance comparison between the two workflows. Suppose a very large workload is assigned to machine A and negligible workload assigned machine B. After machine B completes its task, the average machine utilization for the job drops to 50%. If there are 3 machines, and A assigned a very large workload while B and C are assigned negligible workloads then we see that after B and C complete their tasks, utilization
25 Chapter 4 Evaluation 4.1 Experimental Evaluation We now benchmark the performance of our RLP against the hash partitioning baseline that comes standard in Hadoop. Taking reference from earlier work in MapReduce-based Record Linkage (Lin, 2008; Vernica, Carey, & Li, 2010b), we measure performance in absolute running
26 All work reported in this paper used synthetic personal records from Febrl s data generator (Christen et al., 2004). However, we partially rewrote the generator to write records to disk at regular intervals instead of keeping them in main memory. This enabled us to generate gigabytesized datasets. To keep the de-duplication cases simple, we set the modification probabilities of the first and given name (which we used to block and cluster duplicates) were set to zero, and the number of duplicates set at 10% of the originals. More details are described in the individual experiments below Hash Baseline Our first experiment verifies that record linkage performance under data skew conditions is a genuine problem in Hadoop. We generated a dataset of 5000 personal records, increasing the frequency of Joshua appearing in the first name field to simulate data skew. The generator labeled 3292 of the 5000 records with Joshua. Subsequently, we ran a de-duplication job in Hadoop over this data. Figure 4.1: Reduce task running time after Hash partitioning Figure 4.1 shows the running time of all Reduce tasks in our experiment. As expected, the default partitioner was unable to assign even task workloads because it assigned all records with the same hash to the same block. Task 8 was assigned the Joshua key and processed all 3292 records. In contrast, the largest block that Task 9 was assigned had only 21 records. 19
27 4.1.2 Runtime Comparison Analysis We compare our system against the hash baseline as described in the previous section. Table 4.1 shows the results of running the duplicate detection job over 5 runs on a skewed dataset (Dataset A). Unlike the previous experiment, we created the skewed dataset by quadrupling the population size in the distribution (see Figure 4.2) instead of simply inflating a single frequency entry. This is a more realistic distribution because the Febrl s frequency distribution resembles a power law. Figure 4.2: Frequency distribution of first names Dataset A: records, Hash Baseline Our System skewed by quadrupling pop. size 2 machines machines machines Table 4.1: Average runtime to completion in seconds Empirically we see that our RLP system is able to obtain a superior running time by balancing the record comparisons on 2, 3 and 4 machines. Significantly, we notice that our system is able to keep the same runtime for this workload (better, in fact) while using fewer machines. A secondary observation from Table 4.1 is that our performance on 2 and 4 machines was better than our performance on 3 machines. Closer analysis revealed that runs on 2 and 4 machines did not incur the additional cost of merging duplicates because the partitioner was able to balance the record comparison workload without dividing any block of records into more 20
28 than two partitions in the 2nd pass. This lends support to our partitioner s design principle of not dividing blocks into more partitions than necessary (even when more machines are available). We extrapolate that our system will continue to show superior performance with more machines particularly as the size of the dataset grows. This is supported by the result in Table 4.2. Table 4.2 shows the results of running the same duplicate detection job over 5 runs on a larger dataset of records. While our performance for this workload is comparable with the workload for dataset A, the performance of the hash partitioner worsens significantly. Dataset B: records, Hash Baseline Our System skewed by quadrupling pop. size 2 machines machines machines Table 4.2: Average runtime to completion in seconds The RLP s balancing of record comparisons has the additional benefit of reducing the amount of memory required by any one machine. This explains why our system is able to complete jobs that the hash partitioner cannot handle, as exhibited in Table 4.3. However due to time constraint, we do not have the memory benchmark available. Dataset C: records, Hash Baseline Our System skewed by quadrupling pop. size 2 machines Memory error Table 4.3: Average runtime to completion in seconds To conclude, our experimentation show that by improving the utilization of machines, we are able to reduce the runtime of record linkage over skewed distributions. 21
29 Chapter 5 Conclusion In this paper, we introduced a new partitioning algorithm, RLP, for use in Hadoop for record linkage problems. The RLP partitioner considers the workload in pairwise record comparisons and dynamically balances the workload to be assigned to parallel tasks. This enables the distribution of records to abide more closely to the Hadoop s assumption that tasks are assigned the same amount of work, and is our unique contribution to record linkage. Our system combines the RLP with our two pass workflow which uses block-based parallelism followed by match-based parallelism. Our implementation of the subset replication method (for match-based parallelism) does not touch the Hadoop internals and is generalizable to other Hadoop-based problems. It is a well known method for handling data skew (DeWitt, Naughton, Schneider, & Seshadri, 1992) but to the best of our knowledge has never been implemented in Hadoop. Our evaluation reveals that our system significantly outperforms the Hadoop s default partitioner for record linkage problems involving data skew. As a secondary contribution, our modifications to febrl s data generator will allow individuals interested in synthetic big data to generating gigabyte-sized datasets. This paper shows that Hadoop can be tailored for record linkage tasks. The RLP and associated workflow is just one example of possible improvements amongst many. Because Hadoop-based record linkage has not been very actively explored, there are many possible directions for future work. We plan to make further refinements to our dynamic record linkage partitioner. Our partitioner focuses on evenly balancing record comparisons and our improve- 22
30 ments in runtime suggest that this is a step in the right direction. However, we note that the runtime of reduce tasks in the first pass are not perfectly balanced yet. While less skewed, tasks designated to process large blocks still take longer to finish. This suggests that the our partitioner could be further tuned to aggressively assign more record comparisons to tasks without large blocks. Alternatively, adding the cost of clustering records into the partitioning equation promises to be an interesting direction for future work. We would also like to add sampling as a source of domain knowledge. Our research uses domain knowledge to obtain an improvement in runtime primarily by parsing external sources of information. We also implemented a solution that obtains this information by performing counts. A third methodology, sampling of the input data is left unexplored. However we are aware that Hadoop has an InputSampler class and hence this addition should not be too complicated. Sampling is likely to be a valuable addition to our partitioning implementation because it enables us to dynamically obtain the requite information at little cost. Lastly, we recognize that Hadoop-specific customizations need to be made to the open source record linkage toolkit (Tan, 2010) we utilized. Our implementation relies on libraries from an record linkage toolkit provided by the Web Information Retrieval / Natural Language Processing Group in the National University of Singapore. While we were successful at fusing this toolkit with Hadoop, a rewrite is needed to convert its data structures to mutable encodings. This will allow us to perform experimentation with larger datasets, which we believe will unveil a stronger experimentation result. However, at this point of writing, we have been unable to do so because of memory problems. (Jiang et al., 2010) observed that immutable decoding introduces high overhead in MapReduce, and indeed this seems to be our situation. Inspecting the dumps, we see that our memory restrictions are largely caused by immutable data structures resulting in a large number of object creations. Converting these data structure will also pave the way for the development of auxiliary functions to facilitate data canonicalization in Hadoop. 23
31 References Amdahl, G. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference (pp ), 1967: ACM. Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. ACM SIGKDD, Vol. 3 (pp ), 2003: Citeseer. Borthakur, D. (2007). The hadoop distributed file system: Architecture and design. retrieved from lucene.apache.org/hadoop,, Christen, P., Churches, T., & Hegland, M. (2004). A Parallel Open Source Data Linkage System. In Springer Lecture Notes in Artificial Intelligence, Sydney, Australia,, Cole, R. (2008). Parallel merge sort. Foundations of Computer Science, 1986., 27th,, 2008, DeWitt, D., & Gray, J. (1992). Parallel database systems: the future of high performance database systems. Communications of the ACM, 35 (6), 1992, DeWitt, D., Naughton, J., Schneider, D., & Seshadri, S. (1992). Practical skew handling in parallel joins. Proceedings of the International Conference on Very Large Data Bases (pp ), 1992: Citeseer. Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19 (1), January, 2007, Elsayed, T., Lin, J., & Oard, D. (2008). Pairwise document similarity in large collections with MapReduce. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (pp ), Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1969, Flatt, H. (1989). Performance of parallel processors. Parallel Computing, 12 (1), October, 1989, Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE joint conference on Digital libraries (pp ), 2004: ACM. Hernández, M. A., & Stolfo, S. J. (1998). Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. 24
32 Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys (CSUR), 31 (3), 1999, Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84 (406), 1989, Jiang, D., Ooi, B., & Shi, L. (2010). The Performance of MapReduce: An In-depth Study. Proceedings of the VLDB, Vol. 3, Kawai, H., Garcia-Molina, H., Benjelloun, O., & Menestrina, D. (2006). P-swoosh: Parallel algorithm for generic entity resolution. Technical report, Stanford University,, Kim, H.-s., & Lee, D. (2007). Parallel linkage. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM 07 (pp ), New York, New York, USA, 2007: ACM Press. Kolb, L. (2010). Parallel Sorted Neighborhood Blocking with MapReduce. Arxiv preprint arxiv: ,, Lin, J. (2008). Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with mapreduce. Proceedings of the Conference on Empirical Methods in Natural Language Processing, no. October (pp ), 2008: Association for Computational Linguistics. Lin, J. (2009). Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp ), McCallum, A., Nigam, K., & Ungar, L. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp ), 2000: ACM. Mehrotra, S. (2003). Efficient record linkage in large data sets. Eighth International Conference on Database Systems for Advanced Applications, (DASFAA 2003). Proceedings.,, 2003, Miyazima, S., & Lee, Y. (2000). Power-law distribution of family names in Japanese societies. Physica A: Statistical...,, Newcombe, H., Kennedy, J., Axford, S., & AP (1959). Automatic linkage of vital records. Science,, On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. Proceedings of the 5th ACM/IEEE- CS joint conference on Digital libraries (pp ), Ravikumar, P., & Cohen, W. (2004). A hierarchical graphical model for record linkage. Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp ), Tan, Y., Kan, M., & Lee, D. (2006). Search engine driven author disambiguation. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (pp ),
33 Tan, Y. F. (2010). Record Matching Package. retrieved from tanyeefa/downloads/recordmatching/,, Taskar, B., Wong, M., Abbeel, P., & Koller, D. (2003). Link prediction in relational data. In Neural Information Processing Systems, 15, Vernica, R., Carey, M. J., & Li, C. (2010a). Efficient parallel set-similarity joins using MapReduce. Proceedings of the 2010 international conference on Management of data - SIGMOD 10 (pp ), New York, New York, USA, 2010: ACM Press. Vernica, R., Carey, M. J., & Li, C. (2010b). Efficient parallel set-similarity joins using MapReduce. Proceedings of the 2010 international conference on Management of data - SIGMOD 10 (pp ), New York, New York, USA, 2010: ACM Press. Winkler, W. E. (2006). Overview of record linkage and current research directions. Bureau of the Census,, Winkler, W. E., & Thibaudeau, Y. (1991). An application of the Fellegi-Sunter model of record linkage to the 1990 US decennial census. Statistical Research Report Series RR91/09,, Zaharia, M., Borthakur, D., & Sarma, J. S. (2010). Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. Proceedings of the...,, Zanette, D. H., & Manrubia, S. C. (2001). Vertical transmission of culture and the distribution of family names. Physica A: Statistical Mechanics and its...,,
34 Appendix A Hadoop settings The following configurations were made to Hadoop: dfs.replication was pegged to the number of active machines to consider for any effect from data non-locality. mapred.reduce.tasks.speculative.execution was turned off to facilitate a clear picture of cluster efficiency. mapred.reduce.max.attempts were set to 1 and mapred.task.timeout was increased to so that jobs with failed attempts were considered invalid but each task given more time to complete. mapred.child.java.opts: -Xmx6144M -DentityExpansionLimit= XX:-UseGCOverheadLimit Memory properties to lessen the memory restrictions from immutable string encoding. io.file.buffer.size: fs.inmemory.size.mb: 200 dfs.namenode.handler.count: 40 io.sort.mb: 200 io.sort.record.percent: 0.05 io.sort.spill.percent: 0.80 These values are simply best practices recommended by the Hadoop community. A-1
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Learning-based Entity Resolution with MapReduce
Learning-based Entity Resolution with MapReduce Lars Kolb 1 Hanna Köpcke 2 Andreas Thor 1 Erhard Rahm 1,2 1 Database Group, 2 WDI Lab University of Leipzig {kolb,koepcke,thor,rahm}@informatik.uni-leipzig.de
Big Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
Load-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm
Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm Dr.M.Mayilvaganan, M.Saipriyanka Associate Professor, Dept. of Computer Science, PSG College of Arts and Science,
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
How To Balance In Cloud Computing
A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi [email protected] Yedhu Sastri Dept. of IT, RSET,
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
PERFORMANCE MODELS FOR APACHE ACCUMULO:
Securely explore your data PERFORMANCE MODELS FOR APACHE ACCUMULO: THE HEAVY TAIL OF A SHAREDNOTHING ARCHITECTURE Chris McCubbin Director of Data Science Sqrrl Data, Inc. I M NOT ADAM FUCHS But perhaps
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Mining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Detection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
Big Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System
Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System Peter Christen School of Computer Science The Australian National University Canberra ACT 0200,
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
Load Balancing on a Grid Using Data Characteristics
Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Comparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES
CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
How To Use Neural Networks In Data Mining
International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and
A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys
Proceedings of the Twelfth Australasian Symposium on Parallel and Distributed Computing (AusPDC 0), Auckland, New Zealand A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints
Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu
Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations
Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations Alonso Inostrosa-Psijas, Roberto Solar, Verónica Gil-Costa and Mauricio Marín Universidad de Santiago,
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Map-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Optimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases
Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh
Analysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 [email protected] ABSTRACT MapReduce is a programming model
Hybrid Technique for Data Cleaning
Hybrid Technique for Data Cleaning Ashwini M. Save P.G. Student, Department of Computer Engineering, Thadomal Shahani Engineering College, Bandra, Mumbai, India Seema Kolkur Assistant Professor, Department
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
CHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION 1. Introduction 1.1 Data Warehouse In the 1990's as organizations of scale began to need more timely data for their business, they found that traditional information systems technology
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework
A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
Record Deduplication By Evolutionary Means
Record Deduplication By Evolutionary Means Marco Modesto, Moisés G. de Carvalho, Walter dos Santos Departamento de Ciência da Computação Universidade Federal de Minas Gerais 31270-901 - Belo Horizonte
Research Statement Immanuel Trummer www.itrummer.org
Research Statement Immanuel Trummer www.itrummer.org We are collecting data at unprecedented rates. This data contains valuable insights, but we need complex analytics to extract them. My research focuses
Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services
RESEARCH ARTICLE Adv. Sci. Lett. 4, 400 407, 2011 Copyright 2011 American Scientific Publishers Advanced Science Letters All rights reserved Vol. 4, 400 407, 2011 Printed in the United States of America
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Recommendations for Performance Benchmarking
Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Hadoop Performance Diagnosis By Post-execution Log Analysis
Hadoop Performance Diagnosis By Post-execution Log Analysis cs598 Project Proposal Zhijin Li, Chris Cai, Haitao Liu ([email protected],[email protected],[email protected]) University of Illinois
Advances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
Big Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Fig. 3. PostgreSQL subsystems
Development of a Parallel DBMS on the Basis of PostgreSQL C. S. Pan [email protected] South Ural State University Abstract. The paper describes the architecture and the design of PargreSQL parallel database
Processing Large Amounts of Images on Hadoop with OpenCV
Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
How To Test For Elulla
EQUELLA Whitepaper Performance Testing Carl Hoffmann Senior Technical Consultant Contents 1 EQUELLA Performance Testing 3 1.1 Introduction 3 1.2 Overview of performance testing 3 2 Why do performance testing?
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA
IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Scalable Parallel Clustering for Data Mining on Multicomputers
Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Distributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
PARALLEL PROCESSING AND THE DATA WAREHOUSE
PARALLEL PROCESSING AND THE DATA WAREHOUSE BY W. H. Inmon One of the essences of the data warehouse environment is the accumulation of and the management of large amounts of data. Indeed, it is said that
Load Balancing in MapReduce Based on Scalable Cardinality Estimates
Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching
