BIG DATA INTEGRATION AT SCADS DRESDEN/LEIPZIG

Similar documents

BIG DATA INTEGRATION RESEARCH AT THE UNIVERSITY OF LEIPZIG

Learning-based Entity Resolution with MapReduce

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Private Record Linkage with Bloom Filters

SEMI AUTOMATIC DATA CLEANING FROM MULTISOURCES BASED ON SEMANTIC HETEROGENOUS

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Privacy Aspects in Big Data Integration: Challenges and Opportunities

Distributed Computing and Big Data: Hadoop and MapReduce

DduP Towards a Deduplication Framework utilising Apache Spark

Load-Balancing the Distance Computations in Record Linkage

Fast Analytics on Big Data with H20

Big Data Mining Services and Knowledge Discovery Applications on Clouds

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem

Advanced record linkage methods: scalability, classification, and privacy

Scalable Developments for Big Data Analytics in Remote Sensing

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

German Record Linkage Center

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Analytics and Healthcare

A Load-Balanced MapReduce Algorithm for Blocking-based Entity-resolution with Multiple Keys

Industry 4.0 and Big Data

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introduction to Data Mining

Chapter 7. Using Hadoop Cluster and MapReduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Classification On The Clouds Using MapReduce

LDIF - Linked Data Integration Framework

Hadoop Parallel Data Processing

Analysis Tools and Libraries for BigData

Bringing Big Data Modelling into the Hands of Domain Experts

Big Data Analytics. Prof. Dr. Lars Schmidt-Thieme

A Service for Data-Intensive Computations on Virtual Clusters

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

An Overview of Knowledge Discovery Database and Data mining Techniques

Stream Processing on GPUs Using Distributed Multimedia Middleware

Big Data and Analytics: Challenges and Opportunities

COMP9321 Web Application Engineering

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

The basic data mining algorithms introduced may be enhanced in a number of ways.

How To Balance In Cloud Computing

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Mammoth Scale Machine Learning!

1 st Symposium on Colossal Data and Networking (CDAN-2016) March 18-19, 2016 Medicaps Group of Institutions, Indore, India

2015 The MathWorks, Inc. 1

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Journée Thématique Big Data 13/03/2015

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Study of Data Cleaning & Comparison of Data Cleaning Tools

What is Analytic Infrastructure and Why Should You Care?

Parallel Computing. Benson Muite. benson.

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Packet-based Network Traffic Monitoring and Analysis with GPUs

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Big Data Challenges in Bioinformatics

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

NSEC3 Hash Breaking. GPU-based. Matthäus Wander, Lorenz Schwittmann, Christopher Boelmann, Torben Weis IEEE NCA

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Secure Computation Martin Beck

CSE-E5430 Scalable Cloud Computing Lecture 2

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Data Centric Systems (DCS)

Lecture 10 - Functional programming: Hadoop and MapReduce

Luncheon Webinar Series May 13, 2013

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

SQL Server 2005 Features Comparison

HiBench Introduction. Carson Wang Software & Services Group

Delivering Smart Answers!

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Data Warehousing und Data Mining

Spam Detection Using Customized SimHash Function

Accelerating variant calling

Protein Protein Interaction Networks

High Productivity Data Processing Analytics Methods with Applications

Exploiting Cloud Heterogeneity for Optimized Cost/Performance MapReduce Processing

Analytics on Big Data

Transcription:

BIG DATA INTEGRATION AT SCADS DRESDEN/LEIPZIG ERHARD RAHM, UNIV. LEIPZIG www.scads.de

GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin Big Data Center (BBDC) ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions Dresden/Leipzig) scientific coordinators: Nagel (TUD), Rahm (UL) start: Oct. 2014 duration: 4 years (option for 3 more years) initial funding: ca. 5.6 Mio. Euro 2

GOALS Bundling and advancement of existing expertise on Big Data Development of Big Data Services and Solutions Big Data Innovations Leipzig Dresden 3

FUNDED INSTITUTES Univ. Leipzig TU Dresden Leibniz Institute of Ecological Urban and Regional Development Max-Planck Institute for Molecular Cell Biology and Genetics 4

ASSOCIATED PARTNERS Avantgarde-Labs GmbH Data Virtuality GmbH E-Commerce Genossenschaft e. G. European Centre for Emerging Materials and Processes Dresden Fraunhofer-Institut für Verkehrs- und Infrastruktursysteme Fraunhofer-Institut für Werkstoff- und Strahltechnik GISA GmbH Helmholtz-Zentrum Dresden - Rossendorf Hochschule für Telekommunikation Leipzig Institut für Angewandte Informatik e. V. Landesamt für Umwelt, Landwirtschaft und Geologie Netzwerk Logistik Leipzig-Halle e. V. Sächsische Landesbibliothek Staatsund Universitätsbibliothek Dresden Scionics Computer Innovation GmbH Technische Universität Chemnitz Universitätsklinikum Carl Gustav Carus 5

GROBSTRUKTUR DES ZENTRUMS Life sciences Material and Engineering sciences Environmental / Geo sciences Digital Humanities Business Data Service center Big Data Life Cycle Management and Workflows Data Quality / Data Integration Knowledge Extraktion Visual Analytics Efficient Big Data Architectures 6

RESEARCH PARTNERS Data-intensive computing W.E. Nagel Data quality / Data integration E. Rahm Databases W. Lehner, E. Rahm Knowledge extraction/data mining C. Rother, P. Stadler, G. Heyer Visualization S. Gumhold, G. Scheuermann Service Engineering, Infrastructure K.-P. Fähnrich, W.E. Nagel, M. Bogdan 7

APPLICATION CORRDINATORS Life sciences G. Myers Material / Engineering sciences M. Gude Environmental / Geo sciences J. Schanze Digital Humanities G. Heyer Business Data B. Franczyk 8

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Summary and outlook References 9

Heterogeneity Volume Velocity Privacy Human collaboration BIG DATA ANALYSIS PIPELINE Data aquisition Data extraction / cleaning Data integration/ annotation Data analysis and visualization Interpretation 10

OBJECT MATCHING (DEDUPLICATION) Identification of semantically equivalent objects within one data source or between different sources Original focus on structured (relational) data, e.g. customer data Cno LastName FirstName Gender Address Phone/Fax 24 Smith Christoph M 23 Harley St, Chicago IL, 60633-2394 493 Smith Kris L. F 2 Hurley Place, South Fork MN, 48503-5998 333-222-6542 / 333-222-6599 444-555-6666 CID Name Street City Sex 11 Kristen Smith 2 Hurley Pl South Fork, MN 48503 0 24 Christian Smith Hurley St 2 S Fork MN 1 11

BIG DATA INTEGRATION USE CASE INTEGRATION OF PRODUCT OFFERS IN COMPARISON PORTAL Thousands of data sources (shops/merchants) Millions of products and product offers Continous changes Many similar, but different products Low data quality 12

USE OF PRODUCT CODES Frequent existence of specific product codes for certain products Product code = manufacturer-specific identifier any sequence consisting of alphabetic, special, and numeric characters split by an arbitrary number of white spaces. Utilize to differentiate similar but different products. Canon VIXIA HF S100 Camcorder - 1080p - 8.59 MP Hahnel HL-XF51 7.2V 680mAh for Sony NP-FF51 13

Web Verification PRODUCT CODE EXTRACTION Hahnel HL-XF51 7.2V 680mAh for Sony NP-FF51 7.2V 680mAh Features Hahnel HL-XF51 for Sony NP-FF51 Tokens Hahnel HL-XF51 Sony NP-FF51 Filtered Tokens HL-XF51 NP-FF51 Candidates [A-Z]{2}\-[A-Z]{2}[0-9]{2} 14

LEARNING-BASED MATCH APPROACH Pre-processing Training Product Code Extraction Training Data Selection Matcher Application Classifier Learning Product Offers Manufacturer Cleaning Application Classifier Automatic Classification Blocking (Manufacturer + Category) Matcher Application Classification Product Match Result 15

HOW TO SPEED UP OBJECT MATCHING? Blocking to reduce search space group similar objects within blocks based on blocking key restrict object matching to objects from the same block Parallelization split match computation in sub-tasks to be executed in parallel exploitation of Big Data infrastructures such as Hadoop (Map/Reduce or variations) 16

Re-Partitioning Grouping Grouping Grouping GENERAL OBJECT MATCHING WORKFLOW R S Blocking Similarity Computation Match Classification M R S Map Phase: Blocking Reduce Phase: Matching 17

LOAD BALANCING Data skew leads to unbalanced workload Large blocks prevent utilization of more than a few nodes Deteriorates scalability and efficiency Unnecessary costs (you also pay for underutilized machines!) Key ideas for load balancing Additional MR job to determine blocking key distribution, i.e., number and size of blocks (per input partition) Global load balancing that assigns (nearly) the same number of pairs to reduce tasks Simplest approach : BlockSplit (ICDE2012) split large blocks into sub-blocks with multiple match tasks distribute the match tasks among multiple reduce tasks 18

BLOCK SPLIT: 1 SLIDE ILLUSTRATION Example: 3 MP3 players + 6 cell phones 18 pairs (1 time unit) Parallel matching on 2 (reduce) nodes naiive approach BlockSplit 3 pairs (16%) 3 4 3 pairs 6 pairs 9 pairs (50%) 15 pairs (84%) 2 2 4 1 pair 8 pairs 9 pairs (50%) Speedup: 18/15=1.2 Speedup: 2 19

BLOCK SPLIT EVALUATION: SCALABILITY Evaluation on Amazon EC infrastructure using Hadoop Matching of 114.000 product records 20

DEDOOP: EFFICIENT DEDUPLICATION WITH HADOOP Parallel execution of data integration/ match workflows with Hadoop Powerful library of match and blocking techniques Learning-based configuration GUI-based workflow specification Automatic generation and execution of Map/Reduce jobs on different clusters This tool by far shows the most mature use of MapReduce for data deduplication www.hadoopsphere.com Automatic load balancing for optimal scalability Iterative computation of transitive closure (extension of MR-CC) 21

Core DEDOOP OVERVIEW General ER workflow T R S [0,1] Machine Learning R S Blocking Similarity Computation Match Classification M R S Classifier Training Job Data Analysis Job Dedoop s general MapReduce workflow Blocking-based Matching Job Decision Tree Logistic Regression SVM Standard Blocking Sorted Neighborhood PPJoin+ Blocking Key Generators Prefix Token-based Edit Distance n-gram TFIDF Threshold Match rules ML model 22

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Summary and outlook References 23

PRIVACY-PRESERVING RECORD LINKAGE Object matching with encrypted data to preserve privacy data exchange / integration for person-related data many use cases: medicine (e.g., cancer registries), census, numerous PPRL approaches (Vatsalan et al., 2013), some requiring trustee or secure multi-party protocol scalability problem for large datasets (e.g., for census purposes) 24

PPRL WITH BLOOM FILTERS effective and simple approach uses cryptographic bloom filters (Schnell et al, 2009) tokenize all match-relevant attribute values, e.g. using bigrams or trigrams typical attributes: first name, last name (at birth), sex, date of birth, country of birth, place of birth map each token with a family of hash functions to fixed-size bit vector (fingerprint) original data cannot be reconstructed match of bit vectors (Jaccard similarity) is good approximation of true match result 25

SIMILARITY COMPUTATION - EXAMPLE thomas thoman tho hom oma mas tho hom oma man tho hom oma 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 tho hom oma 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 mas h1(mas)= 3 h2(mas)= 7 h3(mas)= 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 man h1(man)= 2 h2(man)= 0 h3(man)= 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 1 Sim Jaccard (r1, r2) = (r1 ᴧ r2) / (r1 ᴠ r2) Sim Jaccard (r1, r2) = 7/11 26

PP-JOIN: POSITION PREFIX JOIN (XIAO ET AL, 2008) one of the most efficient similarity join algorithms determine all pairs of records with sim Jaccard (x,y) t use of filter techniques to reduce search space length, prefix, and position filter relatively easy to run in parallel good candidate to improve scalability for PPRL 27

LENGTH FILTER matching records pairs must have similar lengths Sim Jaccard (x, y) t x y t length / cardinality: number of tokens for strings, number of bits for bit vectors Example for minimal similarity t = 0,8: records tokens (bigrams) length r1=tom [_t, to, om, m_] 4 r2=thomas [_t, th, ho, om, ma, as, s_] 7 r4=tommy [_t, to, om, mm, my, y_] 6 r1 r2 t? 4 5,6? no r4 r2 t? 6 5,6? yes 28

Similar records must have a minimal overlap α in their sets of tokens (or set bit positions) Prefix filter approximates this test order all tokens (bit positions) for all records according to their overall frequency from infrequent to frequent exclude pairs of records without any overlap in their prefixes with Example (t = 0.8) r2=thomas r3=tomas r4=tommy PREFIX FILTER Sim Jaccard (x, y) t Overlap(x, y) α = é( t ( x ) + y ) ) 1+t prefix_length(x) = é ( (1-t) x ) + 1 records sorted tokens prefix length Prefix [ho, th, ma, as, s_, _t, om] [ma, as, s_, to, _t, om] [mm, my, y_, to, -t, om] 3 3 3 [ho, th, ma] [ma, as, s_] [mm, my, y_] prefix(r2) prefix(r3) ={ma} {} prefix(r2) prefix(r4) = {} prefix(r3) prefix(r4) = {} 29

PRIVACY-PRESERVING PP-JOIN (P4JOIN) evaluate overlap of set positions in bit vectors Preprocessing phase determine frequency per bit positions and reorder all bit vectors according to the overall frequency of bit positions determine length and prefix per bit vector sort all bit vectors in ascending order of their length (number of set positions) Match phase (sequential scan) for each record apply length filter to determine window of relevant records to match with apply prefix filter (AND operation on prefix) to exclude record pairs without prefix overlap apply position filter for further savings 30

P4JOIN: PREPROCESSING (1) records (id, bit vector/ fingerprint) ID fingerprint card. tokens (set positions) A 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 0 8 0 2 3 6 9 11 13 14 B C 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 4 7 0 1 5 9 0 2 6 9 11 13 14 determine frequency ordering O f frequency 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 3 1 2 1 0 1 2 0 0 3 0 2 0 2 2 0 count #occurences per index position O f 1 3 5 2 6 11 13 14 0 9 1 1 1 2 2 2 2 2 3 3 sort positions in ascending frequency order (ignore unused positions) 31

PPPP-JOIN: PREPROCESSING (2) reorder fingerprints according to O f ID fingerprint card. tokens (set position) sorted tokens A 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 0 8 0 2 3 6 9 11 13 14 3 2 6 11 13 14 0 9 B 1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 4 0 1 5 9 1 5 0 9 C 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 0 7 0 2 6 9 11 13 14 2 6 11 13 14 0 9 1 3 5 2 6 11 13 14 0 9 O f 1 1 1 2 2 2 2 2 3 3 ID reordered fingerprint card. continue with reordered fingerprints A B 0 1 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 8 4 C 0 0 0 1 1 1 1 1 1 1 0 0 0 7 1 3 5 2 6 11 13 14 0 9 32

P4JOIN: PREPROCESSING (3) sort records by length (cardinality) and determine prefixes prefix_length(x) = é ( (1-t) x ) + 1 ID reordered fingerprint card. prefix length prefix fingerprint B 1 0 1 0 0 0 0 0 1 1 0 0 0 0 4 2 1 0 1 C A 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 7 8 3 3 0 0 0 1 1 1 0 1 0 1 1 33

P4JOIN: APPLY LENGTH FILTER compare records ordered by length ID reordered fingerprint card. B C A 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 4 7 8 length filter 7 * 0.8 = 5.6 when reading record C it is observed that it does not meet the length filter w.r.t. B -> record B ( B = 4) can be excluded from all further comparisons ID reordered fingerprint card. length filter B 1 0 1 0 0 0 0 0 1 1 0 0 0 4 C A 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 7 8 8 * 0.8 = 6.4 record A still needs to be considered w.r.t. C due to similar length 34

P4JOIN: PREFIX FILTER only records with overlapping prefix need to be matched AND operation on prefix fingerprints ID reordered fingerprint card. prefix 0 0 0 fingerprint 1 1 1 B 1 0 1 0 0 0 0 0 1 1 0 0 0 0 4 1 0 1 C 0 0 0 1 1 1 1 1 1 1 0 0 0 0 7 0 0 0 1 1 1 A 0 1 0 1 1 1 1 1 1 1 0 0 0 0 8 0 1 0 1 1 AND operation on prefixes shows non-zero result for C and C so that these records still need to be considered for matching 35

P4JOIN: POSITION FILTER improvement of prefix filter to avoid matches even for overlapping prefixes estimate maximally possible overlap and checking whether it is below the minimal overlap α to meet threshold t original position filter considers the position of the last common prefix token revised position filter record x, prefix 1 1 0 1 length 9 record y, prefix 1 1 1 length 8 highest prefix position (here fourth pos. in x) limits possible overlap with other record: the third position in y prefix cannot have an overlap with x maximal possible overlap = #shared prefix tokens (2) + min (9-3, 8-3)= 7 < minimal overlap α = 8 36

EVALUATION comparison between NestedLoop, P4Join, MultiBitTree MultiBitTree: best filter approach in previous work by Schnell applies length filter and organizes fingerprints within a binary tree so that fingerprints with the same set bits are grouped within sub-trees can be used to filter out many fingerprints from comparison two input datasets R, S determined with FEBRL data generator N=[100.000, 200.000,, 500.000]. R =1/5 N, S =4/5 N bit vector length: 1000 similarity threshold 0.8 37

EVALUATION RESULTS runtime in minutes on standard PC Approach Dataset size N 100.000 200.000 300.000 400.000 500.000 NestedLoop 6,10 27,68 66,07 122,02 194,77 MultiBitTree 4,68 18,95 40,63 78,23 119,73 P4 Length filter only 3,38 20,53 46,48 88,33 140,73 P4 Length+Prefix 3,77 22,98 52,95 99,72 159,22 P4 Length+Prefix+Position 2,25 15,50 40,05 77,80 125,52 similar results for P4Join and Multibit Tree relatively small improvements compared to NestedLoop 38

GPU-BASED PPRL Operations on bit vectors easy to compute on GPUs Length and prefix filters Jaccard similarity Frameworks CUDA und OpenCL support data-parallel execution of general computations on GPUs program ( kernel ) written in C dialect limited to base data types (float, long, int, short, arrays) no dynamic memory allocation (programmer controls memory management) important to minimize data transfer between main memory and GPU memory 39

card prefix card prefix card prefix card prefix bits matches bits bits bits EXECUTION SCHEME partition inputs R and S (fingerprints sorted by length) into equallysized partitions that fit into GPU memory generate match tasks per pair of partition only transfer to GPU if length intervals per partition meet length filter optional use of CPU thread to additionally match on CPU S Match task GPU change S 3 with S 4 R 0 S 0 S 1 S 2 S 3 S 4 GPU thread R 0 S 4 Replace S 3 with S 4 R 0 S 3 M R 0,S3 Kernel 0 r 0 -S 4 R R 1 R 2 R 3 Read M R 0,S4 r R 0-1-S 4 Kernel R 0-1 CPU thread(s) Main memory (host) GPU memory 40

GPU-BASED EVALUATION RESULTS GeForce GT 610 48 Cuda Cores@810MHz 1GB 35 GeForce GT 540M 96 Cuda Cores@672MHz 1GB 100.000 200.000 300.000 400.000 500.000 GForce GT 610 0,33 1,32 2,95 5,23 8,15 GeForce GT 540M 0,28 1,08 2,41 4,28 6,67 improvements by up to a factor of 20, despite low-profile graphic cards still non-linear increase in execution time with growing data volume 41

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Summary and outlook References 42

SUMMARY ScaDS Dresden/Leipzig Research focus on data integration, knowledge extraction, visual analytics broad application areas (scientific + business-related) solution classes for applications with similar requirements Big Data Integration Big data poses new requirements for data integration (variety, volume, velocity, veracity) comprehensive data preprocessinga and cleaning Hadoop-based approaches for improved scalability, e.g. Dedoop Usabilty: machine-learning approaches, GUI, monitoring 43

SUMMARY (2) Privacy-Preserving Record Linkage increasingly important tp protect personal information Scalability issues for Big Data Bloom filters allow simple, effective and relatively efficient match approach still scalability issues for Big Data -> reduce search space and apply parallel processing Privacy-preserving PP-Join (P4JOIN) relatively easy adoption for bit vectors with improved position filter comparable performance to Multibit trees but easier to parallelize GPU version achieves significant speedup further improvements needed to reduce quadratic complexity 44

OUTLOOK Parallel execution of more diverse data integration workflows for text data, image data, sensor data, etc. Learning-based configuration to minimize manual effort (active learning, crowd-sourcing) Holistic integration of many data sources (data + metadata) Clustering across many sources N-way merging of related ontologies (e.g. product taxonomies) Realtime data enrichment and integration for sensor data Improved privacy-preserving record linkage 45

REFERENCES H. Köpcke, A. Thor, S. Thomas, E. Rahm: Tailoring entity resolution for matching product offers. Proc. EDBT 2012: 545-550 L. Kolb, E. Rahm: Parallel Entity Resolution with Dedoop. Datenbank-Spektrum 13(1): 23-32 (2013) L. Kolb, A. Thor, E. Rahm: Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12), 2012 L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012: 618-629 L. Kolb, Z. Sehili, E. Rahm: Iterative Computation of Connected Graph Components with MapReduce. Datenbank-Spektrum 14(2): 107-117 (2014) E. Rahm, W.E. Nagel: ScaDS Dresden/Leipzig: Ein serviceorientiertes Kompetenzzentrum für Big Data. Proc. GI- Jahrestagung 2014: 717 R.Schnell, T. Bachteler, J. Reiher: Privacy-preserving record linkage using Bloom filters. BMC Med. Inf. & Decision Making 9: 41 (2009) Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm: Privacy Preserving Record Linkage with PPJoin. Proc. BTW Conf. 2015 (to appear) D. Vatsalan, P. Christen, V. S. Verykios: A taxonomy of privacy-preserving record linkage techniques. Information Syst. 38(6): 946-969 (2013) C. Xiao, W. Wang, X. Lin, J.X. Yu: Efficient Similarity Joins for Near Duplicate Detection. Proc. WWW 2008 46