Record Linkage in an Hadoop Environment

Similar documents
Energy Efficient MapReduce

Learning-based Entity Resolution with MapReduce

Big Data with Rough Set Using Map- Reduce

Load-Balancing the Distance Computations in Record Linkage

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Efficient and Effective Duplicate Detection Evaluating Multiple Data using Genetic Algorithm

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

How To Balance In Cloud Computing

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

PERFORMANCE MODELS FOR APACHE ACCUMULO:

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Mining Interesting Medical Knowledge from Big Data

Classification On The Clouds Using MapReduce

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Task Scheduling in Hadoop

Private Record Linkage with Bloom Filters

Big Data and Scripting map/reduce in Hadoop

Development and User Experiences of an Open Source Data Cleaning, Deduplication and Record Linkage System

Web Document Clustering

The Performance Characteristics of MapReduce Applications on Scalable Clusters

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Load Balancing on a Grid Using Data Characteristics

Chapter 7. Using Hadoop Cluster and MapReduce

Comparison of Different Implementation of Inverted Indexes in Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Developing MapReduce Programs

Distributed Computing and Big Data: Hadoop and MapReduce

How To Use Neural Networks In Data Mining

A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Load Balance Strategies for DEVS Approximated Parallel and Distributed Discrete-Event Simulations

HADOOP PERFORMANCE TUNING

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Introduction to Parallel Programming and MapReduce

Map-Reduce for Machine Learning on Multicore

MAPREDUCE Programming Model

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Hadoop and Map-Reduce. Swati Gore

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Hadoop Technology for Flow Analysis of the Internet Traffic

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases

Analysis of MapReduce Algorithms

Hybrid Technique for Data Cleaning

Parallel Scalable Algorithms- Performance Parameters

CHAPTER 1 INTRODUCTION

Keywords: Big Data, HDFS, Map Reduce, Hadoop

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

Hadoop Parallel Data Processing

Record Deduplication By Evolutionary Means

Research Statement Immanuel Trummer

Performance Optimization of a Distributed Transcoding System based on Hadoop for Multimedia Streaming Services

How To Handle Big Data With A Data Scientist

Recommendations for Performance Benchmarking

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Performance Diagnosis By Post-execution Log Analysis

Advances in Natural and Applied Sciences

Big Application Execution on Cloud using Hadoop Distributed File System

Decision Tree Learning on Very Large Data Sets

Fig. 3. PostgreSQL subsystems

Processing Large Amounts of Images on Hadoop with OpenCV

Introduction to DISC and Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

GraySort on Apache Spark by Databricks

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

How To Test For Elulla

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

DATA MINING TECHNIQUES AND APPLICATIONS

Scalable Parallel Clustering for Data Mining on Multicomputers

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Log Mining Based on Hadoop s Map and Reduce Technique

Distributed Apriori in Hadoop MapReduce Framework

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

PARALLEL PROCESSING AND THE DATA WAREHOUSE

Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Transcription:

Undergraduate Research Opportunity Program (UROP) Project Report Record Linkage in an Hadoop Environment By Huang Yipeng (U096926R) Department of Computer Science School of Computing National University of Singapore 2010/11

Undergraduate Research Opportunity Program (UROP) Project Report Record Linkage in an Hadoop Environment By Huang Yipeng (U096926R) Department of Computer Science School of Computing National University of Singapore 2010/11 Project No: U079310 Advisor: A/P Min-Yen Kan, Ng Jun Ping Deliverables: Report: 1 Volume Source Code: 1 DVD

Abstract The recent development of the MapReduce paradigm has lessened the difficulty of distributed computing, providing users with a black box capable of handling many difficulties with parallel computing. However, little work has been published on MapReduce-based record linkage. We study how the generic MapReduce framework can be tailored for record linkage problems. In particular, we note that blocking-based parallelism of record linkage problems is hampered by uneven partitioning. We introduce a partitioning solution that dynamically balances the record comparison workload and distributes it using subset replication and match-based parallelism to spread out data skew. Our evaluation shows that our solution consistently outperforms the baseline for sufficiently skewed distributions. Subject Descriptors: H.2.4 [Database Management]: Systems concurrency, parallel databases H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords: Record Linkage, MapReduce, Performance Implementation Software and Hardware: CentOS, Java, Hadoop 0.20.2

Acknowledgement I would like to thank to my supervisors A/P Kan Min-Yen and Mr Ng Jun Ping, and also acknowledge the rest of the WING group for their kind feedback.

List of Figures 3.1 Relation of Record Linkage to MapReduce stages.................. 11 3.2 A De-duplication Example............................... 11 3.3 Balanced partitions / Uneven task workloads.................... 13 3.4 Improved task partitioning (First Pass)....................... 14 3.5 Improved task partitioning (Second Pass)...................... 14 3.6 Grid representation of remaining workload after the first pass........... 16 4.1 Reduce task running time after Hash partitioning.................. 19 4.2 Frequency distribution of first names......................... 20 iv

List of Tables v

Table of Contents vi

Chapter 1 Introduction Databases contain a trove of information that help us to understand specific domains or world knowledge, particularly when information is merged across silos. The best observations are made when analysis is performed on actual data. Data derived from different databases may not be mutually exclusive and must be cleaned to prevent erroneous results. Thus there is a need for linking records across databases to identify similar records and duplicates. Record linkage (also known as record matching, duplicate record detection, name disambiguation, data cleaning, merge-purge, entity resolution, object matching, and approximate text join) is the process of making connections between similar records. Along with multiple databases residing on multiple computers comes the possibility of distributed computation for more efficient record linkage to help manage inefficiencies in processing large datasets. The idea has previously been examined (Kim & Lee, 2007; Kawai, Garcia-Molina, Benjelloun, & Menestrina, 2006; Christen, Churches, & Hegland, 2004) but has drawn little attention from academia in the last couple of years. One possible explanation is that the difficulty of distributed computing has made it, until recently, more trouble than it was worth. These difficulties include efficient data distribution, fault tolerance, dynamic load balancing, portability and scalability (Christen et al., 2004). Thankfully the MapReduce paradigm has rescued its users from many of the above-mentioned concerns. Additionally, MapReduce has enjoyed popularity since its introduction, leading to the creation of open source implementations such as Hadoop (Borthakur, 2007). This suggests that the time may be ripe to revisit parallel record linkage. 1

In this paper, we study how the generic MapReduce framework can be tailored for record linkage problems. In particular, we note that effective parallelism of record linkage problems is hampered by uneven partitioning. Suppose, for example, that we are given two machines and the following records for de-duplication: {(Emily,..., 1), (Emily,..., 2), (Joshua,..., 1),... (Joshua, n)}. By selecting an appropriate key that is shared by duplicate records, we can reduce the comparison-space. E.g. if the name field is selected as the key, we avoid comparing Joshua records with Emily records. Consequently, records are split into the following partitions: machine 1: {(Emily,..., 1), (Emily,..., 2)} machine 2: {(Joshua,..., 1),... (Joshua, n)} This occurs because MapReduce assigns records with the same key to the same partition (property 1) to be compared, and creates non-overlapping partitions. Skewed workloads follow when

4) We propose an overall load balancing solution for record linkage problems that combines the RLP, subset replication, match-based parallelism, and a merging phase. It reduces the absolute runtime on highly skewed datasets by increasing the utilization of machines assigned to a problem. This paper is organized as follows. In Section 2, we survey related work. In Section 3, we describe the RLP and its associated partitioning solution. In Section 4, we evaluate the overall effectiveness of our solution for record linkage problems. Finally, in Section 5, we discuss future directions before concluding in Section 6. 3

Chapter 2 Related Work The scope of record linkage work is broad, hence we will only highlight a selection of fundamental and current developments before diving into the application of record linkage to large datasets, and explaining the weaknesses of existing approaches. 2.1 Identifying Record Linkages To understand how and when should linkages be made, it is helpful to think of the linkage problem as a duplication problem. Records that match exactly should be linked and one considered a duplicate. When trying to identify duplicates, we perform a pairwise comparison of two records. Unfortunately, simply looking for exact matches may not weed out all duplicates; Linkage must also consider inexact matches within a threshold of similarity. A further explanation of this can be found in (Mehrotra). The need for inexact matching was first suggested by Newcombe (Newcombe1959), who linked marriage and birth records to better understand relationships within a human population. He noted two problems, 1) ambiguous matches such as when different individuals shared the same name and 2) typographical errors preventing records from being associated with the same individual. The two problems can be thought of as the requirement for precision (identifying a linkage as match) and recall (identifying all matching linkages). Correct and comprehensive identification are at the heart of record linkage problems because they determine the linkage s 4

usefulness. As a testament to Newcombe s perception, these problems are still being addressed today in pairwise comparison and clustering stages (Jain1999) of record linkage. Newcombe s linkage method was simple. He just subtracted the frequency of a record s agreement from the frequency of its disagreement. He based this on six fields (name, country, age, etc), and noted that no individual field was reliable in connecting two records. The outcome of his experimentation was very positive. 98.3% were genuine linkages and only 0.7% were false positives. That said, his early success did not mean that the same quality could be expected on more generic datasets across the board. Newcombe did not explain how he selected the six fields used for comparison, and the dataset he used was probably one of high quality since it related to the medical domain where mistakes and blank fields are understandably less prevalent. Fellegi and Sunter, following Newcombe s intuitions, showed the optimality of the probabilistic decision rule and developed its formal mathematical model (Fellegi1969). The next major development came from Winkler s adaptation of the Expectation-Maximization algorithm to estimate parameters which tightened probabilistic decision rules (Winkler1991), and in the last 50 years, an abundance of string comparators (Elmagarmid2007), which are used to identify linkages, have been developed and these metrics have become highly accurate. More recently, approaches to record linkage include unsupervised graphical methods (Ravikumar2004), supervised vector-space methods (Han2004), multiple passes (On2005), and the exploitation of internal (Taskar2003) and external (Tan2006) sources of information. In Newcombe s own experiments, he found record linkage to reliable. Speed was a problem, and Newcombe looked expectantly to faster computer speeds for improvement to the speed of record linkage. It seems that many of the Newcombe s first intuitions still hold today. Accuracy is less of a problem as compared to efficiency. However, while our processing power has gone up, so has the size of datasets in general. For example, the Internet Archive website reports that it contains almost 2 petabytes of data and is growing at a rate of 20 terabytes per month. Having identified early and more current developments in record linkage I will describe its application to large datasets and explain why it is a problem. Readers looking for a more comprehensive survey of record linkage should read (Elmagarmid2007Winkler2006). 5

2.2 Dealing with Large Data Consider the deduplication of a single list. The pairwise comparison in this record linkage task is an O( n C 2 ) operation, where n is the total number of records. As n grows large, the comparison step becomes increasingly expensive, because each record must be compared against every other record. To alleviate this, less costly operations have been developed to reduce the size of n and bring down the runtime. This includes blocking (Jaro1989), sorted neighborhoods (Hernandez1998), and canopies (McCallum2000b). These methods all work by reducing the problem search-space. Blocking uses an approximate distance metric to create subsets and limit a record s comparison to other records within the subset. Canopies extend blocks to have overlapping subsets. Sorted neighborhoods first sort records with a key so similar records have spatial locality, and puts this list into a scanning window that only compares records that fit within it. These techniques have helped to reduce the algorithmic runtime on record linkage problems but as datasets grow even larger, efficiency is still an recurring problem. Other complementary techniques are needed. Since the above-mentioned techniques are not the focus of this paper, we will not compare them (See (Baxter2003) for comparative survey). Instead we would like to point out some limitations that these techniques share. Firstly they have been designed to run on a single machine. Consequently, even though extra machines may be available, they cannot be utilized. If such idle computational time is utilized, computational bottlenecks can be prevented elsewhere. Secondly, none of these techniques consider the possibility that data is stored in different locations a situation that reflects the world-reality of databases today. Lastly, these techniques are not complementary methods of reducing runtime. They belong to one category of techniques: Blocking. We suggest parallelism as a complementary approach to the approaches described in this section. 6

2.3 Shift to Parallel Computing Parallel computing is bounded by Amdahl s law (Amdahl1967) and constrained by synchronization and communication overhead (Flatt1989). Consequently, simply adding nodes into a task will not lower runtimes beyond a certain point. However, this does not mean that significant improvements cannot be made. The main objective of parallelism is speedup (the reduction of algorithmic runtime) and scaleup (the increase in work capability to machine ratio) (De- Witt1992). A case in point is (Kim2007) which reports that their parallelized record linkage algorithms were able to run 6.55 times faster than sequential algorithms with multiple CPUs. While parallel algorithms have been developed for many problems, very little literature seems to be devoted to parallel record linkage. Peter Christen was a front-runner who began to parallelize Febrl (Christen2004), a record linkage system, in 2004. It was parallelized with MPI and able to achieve near linear improvement on costly pairwise comparison and classification operations (linear is ideal, see (DeWitt1992)) but its memory usage did not scale well as the dataset grew. Peter first introduced block-based parallelism in in (Christen2002a). The problem of scaling blocking-based parallelism (which distributes and processes different blocks in parallel) is essentially the data skew problem. Sadly Ferbl s parallel direction seems to be discontinued. Retrospectively, this is not surprising since (Christen2004) began its section on parallelism with a long list of parallel programming issues yet to be addressed including efficient data distribution, fault tolerance, dynamic load balancing, portability and scalability. P-Swoosh (Kawai2006) is a match-based parallel algorithm which divides comparison pairs over multiple machines. It is worth mentioning because it is a fairly comprehensive attempt define a suitable architecture for parallelizing record linkage. P-Swoosh had a novel method of matching, dividing tasks between nodes based on matches. Records were first compared serially within a window, and distributed computation utilized only on records that had already been identified as non-matching with the hope of achieving a complete rather than approximate result. It showed a 2x improvement in speedup with the use of domain knowledge but is not without its problems for practical use. P-Swoosh was implemented in a emulated, shared memory environment (machines share global memory) when shared-nothing (machines connected 7

by network) is currently the industry consensus for parallel database architecture. It will be interesting to see whether the validity of the performed experiments replicate well on real-world datasets and systems. Our intuition is that applying the combination of blocking-based parallelism (to quickly reduce the comparison-space) and matching-based parallelism (to spread out data skew) will enable us to enjoy the best of both worlds and obtain a further reduction in runtime. 2.4 Why MapReduce? Although neither (Christen2004Kawai2006Kim2007) used MapReduce for their experiments, we believe it is a viable solution for parallel record linkage because it abstracts away a significant amount of complexity from the parallel programming paradigm that has prevented other parallel solutions from moving forward. There is also support from academia to harness MapReduce for record linkage. A detailed argument for MapReduce s suitability can be found in (Lin2008), and two research groups, from the University of Maryland and University of California, Irvine have published papers related to record linkage using the MapReduce programming model (Vernica2010aLin2009Vernica2010); (Vernica2010) studied the performance of pairwise comparisons and achieved scaleup significantly better than (Christen2004). Another study found that running time increased linearly with collection size (Vernica2010a). Thus it appears that MapReduce offers a convenient model for scaling record linkage tasks. Most recently, (Kolb2010) parallelized sorted neighborhood blocking in MapReduce. However, there is still some way to go before a set of record linkage methodologies compatible with MapReduce is comprehensively developed. (Vernica2010a) and (Lin2009) focused on document collections, but experimentation with other types of data is still absent. (Vernica2010a) and (Vernica2010) have made substantial contributions by presenting an indexed approach and different join algorithms respectively but experiments have yet to cover different types of record matching problems, e.g. clean vs dirty lists, dirty vs dirty lists, as in (Kim2007). (Kolb2010) also noted three problems of atypical record linkage workflows in MapReduce. These are 1) disjoint data partitioning, 2) load balancing, and 3) memory bottlenecks. While 8

(Kolb2010) focused on disjoint partitioning and (Vernica2010) on memory bottlenecks, work that deals with the problem of balancing data skew in MapReduce-based record linkage is noticeably absent. Our work in this paper tackles problem 2, the load balancing of skewed distributions.

Chapter 3 Methodology 3.1 The MapReduce Programming Model MapReduce is a shared-nothing framework for distributed computing best suited for large scale data processing. All data must be process as (key, value) pairs. MapReduce has two primary user defined stages, Map and Reduce. Generally speaking, a problem is mapped out to multiple machines, partitioned into smaller units of work, and then reduced into a single solution. Programmatically, this is possible because the Map() and Reduce() functions are stable and can be executed as individual tasks on different machines. There is also a Partition() that is subsumed under the Map stage in most literature. The first challenge for parallel record linkage is to fit Record Linkage stages to the MapReduce model. Our implementation follows Figure 3.1. We suggest performing data canonicalization (the standardization of data) as a form of preprocessing simply because the Map stage provides a suitable platform for parallel data manipulation. However, Hadoop s ChainReducer class also provides users the ability to perform canonicalization later. The Partition stage is a suitable context to perform standard blocking because it outputs records in key-value pairs and channels records with the same key to the same Reduce task. To illustrate the synergy between Record Linkage and MapReduce, consider Figure 3.2. As with the previous and all future examples, blocking occurs on the first name. 10

Figure 3.1: Relation of Record Linkage (left) to MapReduce (right) stages Figure 3.2: A De-duplication Example 11

Four records are divided into two maps and sorted on their first name field (key). The output tuples of the Map and then shuffled to their respective Reduce task. This strategy is essentially blocking-based parallelism, since records with the same key must end up in the same block. The solution set is the simply the matching records from pairwise comparisons within the same Reduce. In this case, only two comparisons (one in each Reduce) are needed, and the output returns de-duplicated. 3.2 Parallelizing Data Skew Our goal is to ensure that task workloads are fairly even despite data skewness. When the distribution of records is uniform, Hadoop s default partitioner assigns a balanced workload to each (Reduce) task. However when the distribution is skewed, it fails to do so. This occurs because the partitioner is hashed on Hash(Key) mod M, where M is the number of machines. It follows that records with the same hash unavoidably end up in the same partition. This means that if there is a significantly larger number of records with Joshua (for example) as its first name than any other first name, the workload handled by the task assigned to process the Joshua partition will be greater than other tasks with the same number of records, and can even be greater than the combined workload of many other tasks because of pairwise comparison s n C 2 rate of growth. This is illustrated by Figure. The two columns on the left represent the distribution of records, and the two columns on the right are their corresponding workload in terms of comparisons. Within a partition, its task categorizes records into blocks based on their block key. Hence 7 Joshua records belong to one block, 2 Emily records to

Figure 3.3: Balanced partitions but uneven task workloads Our intuition is that better performance can be obtained if some domain knowledge is applied to the record partitioning strategy. Our system partitions based on the number of

larger. Figure 3.4: Improved task partitioning (First Pass) Figure 3.5: Improved task partitioning (Second Pass) 3.3.1 First Pass (Balancing skew in blocking-based parallelism) Our overall objective is to reduce the effects of data skew on runtime. In the first pass, we do this by balancing the number of comparisons given to each data partition in the RLP. Since each Reduce task is assigned a separate partition, this allows us to balance the computational workload of the task, thereby ensuring tasks finish at approximately the same time. In our algorithm, the comparison workload of distinct blocks are computed and assigned in an online fashion to the partition with the fewest comparisons assigned thus far. If the distribution of the datasets is known a priori (as in our case), they can be used to compute the comparison workload; in cases where they are unavailable, they can be calculated by performing counts or sampling in pre-processing. Our system enables a more even record distribution among tasks. When the comparison workload of a block exceeds the average comparison workload per machine, our algorithm divides it up and assigns it to the fewest number of machines required for each cut to fall below the average comparison workload. We compute the average comparison workload by adding up the total comparison workload and dividing it by the number of available nodes. The average comparison workload must be decremented after each cut because comparisons only occur between records in the same block cut and comparisons between block cuts are left to the 14

next pass as in Figure 3.5. In addition to ensuring that no individual block exceeds that average workload per machine, our system also ensures that the total workload assigned to each machine does not exceed the average workload at any point. This can occur because record comparisons are balanced in an online fashion. When this fault occurs, we set the current average as the starting average and recurse from the beginning of the block assignment. Consequently, when larger blocks are broken into more partitions, more comparisons are lost in the first pass, and an even assignment of record comparisons becomes possible. This method has the following properties: 1) It eventually stabilizes to an average workload that ensures a similar computational load for each task and thus similar runtime for each task. 2) It restricts the amount of work in the next round, which handles cross-partition linkages, since it assigns block cuts to a minimal number of partitions. 3.3.2 Second Pass (Shifting to match-based parallelism) The objective of the second pass is to ensure that cross-duplicates from different partitions are detected. However, falling back to the standard MapReduce workflow is not efficient. Our system reduces runtime through a divide and conquer approach that increases the utilization of available machines. We illustrate this with an example. Suppose only records labeled Joshua exceed the average workload in the 1st pass; and the Joshua records are divided into 3 partitions A, B, and C. Then in the second pass, only comparisons of Joshua records between A, B, and C need to be processed. All other intra-partition linkages were previously considered by the first pass. The requisite comparisons can be represented with a grid as in Figure 3.6. Note that each block that exceeds the average workload should be represented on a separate grid. We mark the diagonal with x to reflect that Joshua records within partitions A, B, and C have already been de-duplicated in the first pass. This is unlike regular match-based parallelism that treats all squares on the grid equally. Additionally, since pairwise comparisons are symmetric, we can ignore the gray region. Hence, we only need to compare records in the white regions or {A, B}, 15

{A, C} and {B, C}. Figure 3.6: Grid representation of remaining workload after the first pass The standard MapReduce workflow requires all Joshua records to be processed by 1 Reduce task because they share the same block key. With reference to Figure 3.6, the entire lower triangular must be processed with a single machine. The availability of other machines is inconsequential because no work can be assigned to it. Our system applies domain knowledge from the first pass to utilize available machines. From the first pass, we know which keys, e.g. Joshua records, need further de-duplication and also which partitions they reside in. Hence, it is possible for us to supply as input only records that still need to be compared and assign a specific partition-comparisons sets to each machine as

3.4 Theory of utilization As shown above, the proposed record linkage workflow can return a better utilization than the atypical hash-based MapReduce workflow at handling data skew. We believe that higher utilizations will result in faster runtimes. In this section, we lay out our observations about the theoretical performance comparison between the two workflows. Suppose a very large workload is assigned to machine A and negligible workload assigned machine B. After machine B completes its task, the average machine utilization for the job drops to 50%. If there are 3 machines, and A assigned a very large workload while B and C are assigned negligible workloads then we see that after B and C complete their tasks, utilization

Chapter 4 Evaluation 4.1 Experimental Evaluation We now benchmark the performance of our RLP against the hash partitioning baseline that comes standard in Hadoop. Taking reference from earlier work in MapReduce-based Record Linkage (Lin, 2008; Vernica, Carey, & Li, 2010b), we measure performance in absolute running

All work reported in this paper used synthetic personal records from Febrl s data generator (Christen et al., 2004). However, we partially rewrote the generator to write records to disk at regular intervals instead of keeping them in main memory. This enabled us to generate gigabytesized datasets. To keep the de-duplication cases simple, we set the modification probabilities of the first and given name (which we used to block and cluster duplicates) were set to zero, and the number of duplicates set at 10% of the originals. More details are described in the individual experiments below. 4.1.1 Hash Baseline Our first experiment verifies that record linkage performance under data skew conditions is a genuine problem in Hadoop. We generated a dataset of 5000 personal records, increasing the frequency of Joshua appearing in the first name field to simulate data skew. The generator labeled 3292 of the 5000 records with Joshua. Subsequently, we ran a de-duplication job in Hadoop over this data. Figure 4.1: Reduce task running time after Hash partitioning Figure 4.1 shows the running time of all Reduce tasks in our experiment. As expected, the default partitioner was unable to assign even task workloads because it assigned all records with the same hash to the same block. Task 8 was assigned the Joshua key and processed all 3292 records. In contrast, the largest block that Task 9 was assigned had only 21 records. 19

4.1.2 Runtime Comparison Analysis We compare our system against the hash baseline as described in the previous section. Table 4.1 shows the results of running the duplicate detection job over 5 runs on a skewed dataset (Dataset A). Unlike the previous experiment, we created the skewed dataset by quadrupling the population size in the distribution (see Figure 4.2) instead of simply inflating a single frequency entry. This is a more realistic distribution because the Febrl s frequency distribution resembles a power law. Figure 4.2: Frequency distribution of first names Dataset A: 18000 records, Hash Baseline Our System skewed by quadrupling pop. size 2 machines 193 111 3 machines 149.6 134.6 4 machines 143.4 94.2 Table 4.1: Average runtime to completion in seconds Empirically we see that our RLP system is able to obtain a superior running time by balancing the record comparisons on 2, 3 and 4 machines. Significantly, we notice that our system is able to keep the same runtime for this workload (better, in fact) while using fewer machines. A secondary observation from Table 4.1 is that our performance on 2 and 4 machines was better than our performance on 3 machines. Closer analysis revealed that runs on 2 and 4 machines did not incur the additional cost of merging duplicates because the partitioner was able to balance the record comparison workload without dividing any block of records into more 20

than two partitions in the 2nd pass. This lends support to our partitioner s design principle of not dividing blocks into more partitions than necessary (even when more machines are available). We extrapolate that our system will continue to show superior performance with more machines particularly as the size of the dataset grows. This is supported by the result in Table 4.2. Table 4.2 shows the results of running the same duplicate detection job over 5 runs on a larger dataset of 20000 records. While our performance for this workload is comparable with the workload for dataset A, the performance of the hash partitioner worsens significantly. Dataset B: 20000 records, Hash Baseline Our System skewed by quadrupling pop. size 2 machines 652 110.8 3 machines 601.2 177.2 4 machines 577.8 100 Table 4.2: Average runtime to completion in seconds The RLP s balancing of record comparisons has the additional benefit of reducing the amount of memory required by any one machine. This explains why our system is able to complete jobs that the hash partitioner cannot handle, as exhibited in Table 4.3. However due to time constraint, we do not have the memory benchmark available. Dataset C: 25000 records, Hash Baseline Our System skewed by quadrupling pop. size 2 machines Memory error 127.6 Table 4.3: Average runtime to completion in seconds To conclude, our experimentation show that by improving the utilization of machines, we are able to reduce the runtime of record linkage over skewed distributions. 21

Chapter 5 Conclusion In this paper, we introduced a new partitioning algorithm, RLP, for use in Hadoop for record linkage problems. The RLP partitioner considers the workload in pairwise record comparisons and dynamically balances the workload to be assigned to parallel tasks. This enables the distribution of records to abide more closely to the Hadoop s assumption that tasks are assigned the same amount of work, and is our unique contribution to record linkage. Our system combines the RLP with our two pass workflow which uses block-based parallelism followed by match-based parallelism. Our implementation of the subset replication method (for match-based parallelism) does not touch the Hadoop internals and is generalizable to other Hadoop-based problems. It is a well known method for handling data skew (DeWitt, Naughton, Schneider, & Seshadri, 1992) but to the best of our knowledge has never been implemented in Hadoop. Our evaluation reveals that our system significantly outperforms the Hadoop s default partitioner for record linkage problems involving data skew. As a secondary contribution, our modifications to febrl s data generator will allow individuals interested in synthetic big data to generating gigabyte-sized datasets. This paper shows that Hadoop can be tailored for record linkage tasks. The RLP and associated workflow is just one example of possible improvements amongst many. Because Hadoop-based record linkage has not been very actively explored, there are many possible directions for future work. We plan to make further refinements to our dynamic record linkage partitioner. Our partitioner focuses on evenly balancing record comparisons and our improve- 22

ments in runtime suggest that this is a step in the right direction. However, we note that the runtime of reduce tasks in the first pass are not perfectly balanced yet. While less skewed, tasks designated to process large blocks still take longer to finish. This suggests that the our partitioner could be further tuned to aggressively assign more record comparisons to tasks without large blocks. Alternatively, adding the cost of clustering records into the partitioning equation promises to be an interesting direction for future work. We would also like to add sampling as a source of domain knowledge. Our research uses domain knowledge to obtain an improvement in runtime primarily by parsing external sources of information. We also implemented a solution that obtains this information by performing counts. A third methodology, sampling of the input data is left unexplored. However we are aware that Hadoop has an InputSampler class and hence this addition should not be too complicated. Sampling is likely to be a valuable addition to our partitioning implementation because it enables us to dynamically obtain the requite information at little cost. Lastly, we recognize that Hadoop-specific customizations need to be made to the open source record linkage toolkit (Tan, 2010) we utilized. Our implementation relies on libraries from an record linkage toolkit provided by the Web Information Retrieval / Natural Language Processing Group in the National University of Singapore. While we were successful at fusing this toolkit with Hadoop, a rewrite is needed to convert its data structures to mutable encodings. This will allow us to perform experimentation with larger datasets, which we believe will unveil a stronger experimentation result. However, at this point of writing, we have been unable to do so because of memory problems. (Jiang et al., 2010) observed that immutable decoding introduces high overhead in MapReduce, and indeed this seems to be our situation. Inspecting the dumps, we see that our memory restrictions are largely caused by immutable data structures resulting in a large number of object creations. Converting these data structure will also pave the way for the development of auxiliary functions to facilitate data canonicalization in Hadoop. 23

References Amdahl, G. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference (pp. 483 485), 1967: ACM. Baxter, R., Christen, P., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. ACM SIGKDD, Vol. 3 (pp. 25 27), 2003: Citeseer. Borthakur, D. (2007). The hadoop distributed file system: Architecture and design. retrieved from lucene.apache.org/hadoop,, 2007. Christen, P., Churches, T., & Hegland, M. (2004). A Parallel Open Source Data Linkage System. In Springer Lecture Notes in Artificial Intelligence, Sydney, Australia,, 2004. Cole, R. (2008). Parallel merge sort. Foundations of Computer Science, 1986., 27th,, 2008, 511 516. DeWitt, D., & Gray, J. (1992). Parallel database systems: the future of high performance database systems. Communications of the ACM, 35 (6), 1992, 85 98. DeWitt, D., Naughton, J., Schneider, D., & Seshadri, S. (1992). Practical skew handling in parallel joins. Proceedings of the International Conference on Very Large Data Bases (pp. 27 27), 1992: Citeseer. Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19 (1), January, 2007, 1 16. Elsayed, T., Lin, J., & Oard, D. (2008). Pairwise document similarity in large collections with MapReduce. Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (pp. 265 268), 2008. Fellegi, I., & Sunter, A. (1969). A theory for record linkage. Journal of the American Statistical Association, 64 (328), 1969, 1183 1210. Flatt, H. (1989). Performance of parallel processors. Parallel Computing, 12 (1), October, 1989, 1 20. Han, H., Giles, L., Zha, H., Li, C., & Tsioutsiouliklis, K. (2004). Two supervised learning approaches for name disambiguation in author citations. Proceedings of the 4th ACM/IEEE joint conference on Digital libraries (pp. 296 305), 2004: ACM. Hernández, M. A., & Stolfo, S. J. (1998). Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. 24

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys (CSUR), 31 (3), 1999, 264 323. Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84 (406), 1989, 414 420. Jiang, D., Ooi, B., & Shi, L. (2010). The Performance of MapReduce: An In-depth Study. Proceedings of the VLDB, Vol. 3, 2010. Kawai, H., Garcia-Molina, H., Benjelloun, O., & Menestrina, D. (2006). P-swoosh: Parallel algorithm for generic entity resolution. Technical report, Stanford University,, 2006. Kim, H.-s., & Lee, D. (2007). Parallel linkage. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM 07 (pp. 283 292), New York, New York, USA, 2007: ACM Press. Kolb, L. (2010). Parallel Sorted Neighborhood Blocking with MapReduce. Arxiv preprint arxiv:1010.3053,, 2010. Lin, J. (2008). Scalable language processing algorithms for the masses: A case study in computing word co-occurrence matrices with mapreduce. Proceedings of the Conference on Empirical Methods in Natural Language Processing, no. October (pp. 419 428), 2008: Association for Computational Linguistics. Lin, J. (2009). Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 155 162), 2009. McCallum, A., Nigam, K., & Ungar, L. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 169 178), 2000: ACM. Mehrotra, S. (2003). Efficient record linkage in large data sets. Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings.,, 2003, 137 146. Miyazima, S., & Lee, Y. (2000). Power-law distribution of family names in Japanese societies. Physica A: Statistical...,, 2000. Newcombe, H., Kennedy, J., Axford, S., & AP (1959). Automatic linkage of vital records. Science,, 1959. On, B., Lee, D., Kang, J., & Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. Proceedings of the 5th ACM/IEEE- CS joint conference on Digital libraries (pp. 344 353), 2005. Ravikumar, P., & Cohen, W. (2004). A hierarchical graphical model for record linkage. Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp. 454 461), 2004. Tan, Y., Kan, M., & Lee, D. (2006). Search engine driven author disambiguation. Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (pp. 314 315), 2006. 25

Tan, Y. F. (2010). Record Matching Package. retrieved from http://wing.comp.nus.edu.sg/ tanyeefa/downloads/recordmatching/,, 2010. Taskar, B., Wong, M., Abbeel, P., & Koller, D. (2003). Link prediction in relational data. In Neural Information Processing Systems, 15, 2003. Vernica, R., Carey, M. J., & Li, C. (2010a). Efficient parallel set-similarity joins using MapReduce. Proceedings of the 2010 international conference on Management of data - SIGMOD 10 (pp. 495 506), New York, New York, USA, 2010: ACM Press. Vernica, R., Carey, M. J., & Li, C. (2010b). Efficient parallel set-similarity joins using MapReduce. Proceedings of the 2010 international conference on Management of data - SIGMOD 10 (pp. 495 506), New York, New York, USA, 2010: ACM Press. Winkler, W. E. (2006). Overview of record linkage and current research directions. Bureau of the Census,, 2006. Winkler, W. E., & Thibaudeau, Y. (1991). An application of the Fellegi-Sunter model of record linkage to the 1990 US decennial census. Statistical Research Report Series RR91/09,, 1991. Zaharia, M., Borthakur, D., & Sarma, J. S. (2010). Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. Proceedings of the...,, 2010. Zanette, D. H., & Manrubia, S. C. (2001). Vertical transmission of culture and the distribution of family names. Physica A: Statistical Mechanics and its...,, 2001. 26

Appendix A Hadoop settings The following configurations were made to Hadoop: dfs.replication was pegged to the number of active machines to consider for any effect from data non-locality. mapred.reduce.tasks.speculative.execution was turned off to facilitate a clear picture of cluster efficiency. mapred.reduce.max.attempts were set to 1 and mapred.task.timeout was increased to 600000 so that jobs with failed attempts were considered invalid but each task given more time to complete. mapred.child.java.opts: -Xmx6144M -DentityExpansionLimit=2500000 -XX:-UseGCOverheadLimit Memory properties to lessen the memory restrictions from immutable string encoding. io.file.buffer.size: 131072 fs.inmemory.size.mb: 200 dfs.namenode.handler.count: 40 io.sort.mb: 200 io.sort.record.percent: 0.05 io.sort.spill.percent: 0.80 These values are simply best practices recommended by the Hadoop community. A-1