Large-Scale Data Cleaning Using Hadoop. UC Irvine



Similar documents
CSCI 5417 Information Retrieval Systems Jim Martin!

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Introduction to Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Introduction to Parallel Programming and MapReduce

Introduction to Hadoop

Infrastructures for big data

Databases 2 (VU) ( )

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

Distributed Computing and Big Data: Hadoop and MapReduce

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Performance and Energy Efficiency of. Hadoop deployment models

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Hadoop and Map-reduce computing

Scalable Machine Learning - or what to do with all that Big Data infrastructure

CS 378 Big Data Programming

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Inverted Indexes: Trading Precision for Efficiency

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Developing MapReduce Programs

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Implementing Heuristic Miner for Different Types of Event Logs

Data Mining in the Swamp

Distributed Apriori in Hadoop MapReduce Framework

Big Data and Scripting map/reduce in Hadoop

Image Compression through DCT and Huffman Coding Technique

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Cloud Computing Summary and Preparation for Examination

Cloud Computing at Google. Architecture

Data Intensive Computing Handout 6 Hadoop

Answering Approximate String Queries on Large Data Sets Using External Memory

How To Improve Performance In A Database

Introduction to Hadoop

What s next for the Berkeley Data Analytics Stack?

Index support for regular expression search. Alexander Korotkov PGCon 2012, Ottawa

Why is Internal Audit so Hard?

DATA CLUSTERING USING MAPREDUCE

Data Intensive Computing Handout 5 Hadoop

Efficient Processing of XML Documents in Hadoop Map Reduce

Big Data Platforms: What s Next?

Management & Analysis of Big Data in Zenith Team

CIEL A universal execution engine for distributed data-flow computing

International Journal of Innovative Research in Computer and Communication Engineering

Comparison of Different Implementation of Inverted Indexes in Hadoop

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Processing Joins over Big Data in MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Turning Big Data into Big Decisions Delivering on the High Demand for Data

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

How To Scale Out Of A Nosql Database

Big Data and Apache Hadoop s MapReduce

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

MapReduce and Hadoop Distributed File System V I J A Y R A O

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Similarity Search in a Very Large Scale Using Hadoop and HBase

ProM 6 Exercises. J.C.A.M. (Joos) Buijs and J.J.C.L. (Jan) Vogelaar {j.c.a.m.buijs,j.j.c.l.vogelaar}@tue.nl. August 2010

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Lecture Data Warehouse Systems

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Cloudera Certified Developer for Apache Hadoop

Big Data with Rough Set Using Map- Reduce

Map-like Wikipedia Visualization. Pang Cheong Iao. Master of Science in Software Engineering

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

SEO AND CONTENT MANAGEMENT SYSTEM

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams

Computing Issues for Big Data Theory, Systems, and Applications

A Comparison of Approaches to Large-Scale Data Analysis

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Dynamical Clustering of Personalized Web Search Results

Large-Scale Test Mining

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop WordCount Explained! IT332 Distributed Systems

Can the Elephants Handle the NoSQL Onslaught?

Movie Classification Using k-means and Hierarchical Clustering

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Load-Balancing the Distance Computations in Record Linkage

Fuzzy Multi-Join and Top-K Query Model for Search-As-You-Type in Multiple Tables

Transcription:

Chen Li UC Irvine Joint work with Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica 1

Overview Importance of information Importance of information quality Data cleaning Large scale Hadoop 2

Data Quality Example 1 Star Title Year Genre Keanu Reeves The Matrix 1999 Sci-Fi Samuel Jackson Iron man 2008 Sci-Fi Schwarrzeneger The Terminator 1984 Sci-Fi Samuel Jackson The man 2006 Crime 3

Data Quality Case 2 4

Outline Three data cleaning applications Document Cleaning Fuzzy string search Record linkage (with preliminary results) Conclusion 5

Data Cleaning Application 1 Document Cleaning 6

Wikipedia: Wikify project 7

Wiki Linking Policies Do not link terms that most readers are familiar with Neither too many nor too few links How to detect violations automatically? 8

An example wiki page No links at all! http://en.wikipedia.org/wiki/st_antony%27s_forane_church_,_kurumpanadam 9

Another wiki Too many trivial links! http://en.wikipedia.org/wiki/lacamas_lake 10

Challenge Corpus-wide analysis of links and entities Implicit it join operations on page entities (even fuzzy) Scale: millions of pages (> 3M for English) # of entities/links could be even larger (even more) Plan: Hadoop 11

Data Cleaning Application 2 Fuzzy string search 12

Example: Web Search Actual queries gathered by Google Errors in queries Errors in data Bring query and meaningful results closer together http://www.google.com/jobs/britney.html 13

Fuzzy String Search: Formulation Find strings similar to a given string Functions: edit distance, jaccard, cosine, etc. Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 14

q-grams of strings u n i v e r s a l 2-grams 15

Similar strings have many common grams universal u m i v e r s s l 16

q-gram inverted lists id strings 0 rich 1 stick 2 stich 3 stuck 4 static 2-grams at ch ck ic ri st ta ti tu uc 3 4 0 2 1 3 0 1 2 4 0 1 2 3 4 4 1 2 4 3 17

T-occurrence Problem Merge Find elements whose occurrences T 18

Optimizing i i inverted index Compressing inverted lists Using variable-length grams All these techniques need a cost-based analysis on the effects of grams on query performance Computationally expensive 19

Challenge: Scale! PubMed publications:18 million records GeneBank: 108 million sequences Google frequent Web word tokens: 1 trillion Parallel computing (Hadoop) to the rescue! 20

Data Cleaning Application 3 Record Linkage with preliminary results 21

Record Linkage Phone Age Name Brad Pitt Arnold Schwarzeneger George Bush Angelina Jolie Forrest Whittaker No exact match! Name Hobbies Address Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger 22

Fuzzy Join RID A B RID A B dist(t1.a,t2.a) k T1 T2 Self-Join : T1 = T2 23

Two Critical Steps Step 1: Sorting grams based on frequencies Step 2: Computing record ID pairs with common grams 24

Stage 1 Phase 1 25

Stage 1 Phase 2 26

Limitations: Step1 Analysis Second phase is just to sort the tokens Using one Reduce (not parallelizable) A possible optimization: Eliminate the second MR phase Sort in the Reduce of Phase 1 Result: 10M records, 10 nodes, 2-MR algorithm: 81 seconds 1-MR: 82 seconds 27

Step1: Lesson Learned Extra steps are needed to get right key-value pairs 1-MR solution might not be faster than 2-MR solution 28

Stage 2 Phase 1 29

Stage 2 Phase 2 30

Experiments 31

Outline Three data cleaning applications Document Cleaning Fuzzy string search Record linkage (with preliminary results) Conclusion 32

UCI ASTERIX Project Managing large amounts of semi-structured data using gparallel computing $2.7M from NSF (thanks, Jim!) Other campuses: UCR and UCSD http://asterix.ics.uci.edu/ 33