Large-Scale Data Cleaning Using Hadoop. UC Irvine

Size: px

Start display at page:

Download "Large-Scale Data Cleaning Using Hadoop. UC Irvine"

Anis Atkins
10 years ago
Views:

1 Chen Li UC Irvine Joint work with Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica 1

2 Overview Importance of information Importance of information quality Data cleaning Large scale Hadoop 2

3 Data Quality Example 1 Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarrzeneger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime 3

4 Data Quality Case 2 4

5 Outline Three data cleaning applications Document Cleaning Fuzzy string search Record linkage (with preliminary results) Conclusion 5

6 Data Cleaning Application 1 Document Cleaning 6

7 Wikipedia: Wikify project 7

8 Wiki Linking Policies Do not link terms that most readers are familiar with Neither too many nor too few links How to detect violations automatically? 8

9 An example wiki page No links at all! 9

10 Another wiki Too many trivial links! 10

11 Challenge Corpus-wide analysis of links and entities Implicit it join operations on page entities (even fuzzy) Scale: millions of pages (> 3M for English) # of entities/links could be even larger (even more) Plan: Hadoop 11

fuzzy) Scale: millions of pages (> 3M for English) # of

12 Data Cleaning Application 2 Fuzzy string search 12

13 Example: Web Search Actual queries gathered by Google Errors in queries Errors in data Bring query and meaningful results closer together 13

14 Fuzzy String Search: Formulation Find strings similar to a given string Functions: edit distance, jaccard, cosine, etc. Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 14

15 q-grams of strings u n i v e r s a l 2-grams 15

16 Similar strings have many common grams universal u m i v e r s s l 16

17 q-gram inverted lists id strings 0 rich 1 stick 2 stich 3 stuck 4 static 2-grams at ch ck ic ri st ta ti tu uc

18 T-occurrence Problem Merge Find elements whose occurrences T 18

19 Optimizing i i inverted index Compressing inverted lists Using variable-length grams All these techniques need a cost-based analysis on the effects of grams on query performance Computationally expensive 19

20 Challenge: Scale! PubMed publications:18 million records GeneBank: 108 million sequences Google frequent Web word tokens: 1 trillion Parallel computing (Hadoop) to the rescue! 20

21 Data Cleaning Application 3 Record Linkage with preliminary results 21

22 Record Linkage Phone Age Name Brad Pitt Arnold Schwarzeneger George Bush Angelina Jolie Forrest Whittaker No exact match! Name Hobbies Address Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger 22

23 Fuzzy Join RID A B RID A B dist(t1.a,t2.a) k T1 T2 Self-Join : T1 = T2 23

24 Two Critical Steps Step 1: Sorting grams based on frequencies Step 2: Computing record ID pairs with common grams 24

25 Stage 1 Phase 1 25

26 Stage 1 Phase 2 26

27 Limitations: Step1 Analysis Second phase is just to sort the tokens Using one Reduce (not parallelizable) A possible optimization: Eliminate the second MR phase Sort in the Reduce of Phase 1 Result: 10M records, 10 nodes, 2-MR algorithm: 81 seconds 1-MR: 82 seconds 27

28 Step1: Lesson Learned Extra steps are needed to get right key-value pairs 1-MR solution might not be faster than 2-MR solution 28

29 Stage 2 Phase 1 29

30 Stage 2 Phase 2 30

31 Experiments 31

32 Outline Three data cleaning applications Document Cleaning Fuzzy string search Record linkage (with preliminary results) Conclusion 32

33 UCI ASTERIX Project Managing large amounts of semi-structured data using gparallel computing $2.7M from NSF (thanks, Jim!) Other campuses: UCR and UCSD 33

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad