BIG DATA INTEGRATION RESEARCH AT THE UNIVERSITY OF LEIPZIG

Similar documents

BIG DATA INTEGRATION AT SCADS DRESDEN/LEIPZIG

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

Learning-based Entity Resolution with MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce

Analysis of Web Archives. Vinay Goel Senior Data Engineer

COMP9321 Web Application Engineering

Big Data and Analytics: Challenges and Opportunities

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA What it is and how to use?

Apache Flink Next-gen data analysis. Kostas

How To Scale Out Of A Nosql Database

Big Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

DduP Towards a Deduplication Framework utilising Apache Spark

Evaluating partitioning of big graphs

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Large-Scale Data Processing

Big Data and Apache Hadoop s MapReduce

Fast Analytics on Big Data with H20

Private Record Linkage with Bloom Filters

Protein Protein Interaction Networks

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Big Data Research in the AMPLab: BDAS and Beyond

HiBench Introduction. Carson Wang Software & Services Group

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Introduction to Data Mining

The Stratosphere Big Data Analytics Platform

Big Systems, Big Data

LDIF - Linked Data Integration Framework

Machine Learning over Big Data

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Lecture 10 - Functional programming: Hadoop and MapReduce

Log Mining Based on Hadoop s Map and Reduce Technique

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Architectures for massive data management

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Load-Balancing the Distance Computations in Record Linkage

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Graph Processing and Social Networks

High Productivity Data Processing Analytics Methods with Applications

Big Data With Hadoop

BIG DATA VISUALIZATION. Team Impossible Peter Vilim, Sruthi Mayuram Krithivasan, Matt Burrough, and Ismini Lourentzou

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Information Management course

An Overview of Knowledge Discovery Database and Data mining Techniques

Similarity Search in a Very Large Scale Using Hadoop and HBase

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Apache Kylin Introduction Dec 8,

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Journée Thématique Big Data 13/03/2015

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Processing of Big Data. Nelson L. S. da Fonseca IEEE ComSoc Summer Scool Trento, July 9 th, 2015

German Record Linkage Center

Bringing Big Data Modelling into the Hands of Domain Experts

HIGH PERFORMANCE BIG DATA ANALYTICS

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

The basic data mining algorithms introduced may be enhanced in a number of ways.

Privacy Aspects in Big Data Integration: Challenges and Opportunities

Introduction to Big Data Analytics p. 1 Big Data Overview p. 2 Data Structures p. 5 Analyst Perspective on Data Repositories p.

What is Analytic Infrastructure and Why Should You Care?

Mammoth Scale Machine Learning!

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Information Processing, Big Data, and the Cloud

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Parallel Computing. Benson Muite. benson.

Scalable Developments for Big Data Analytics in Remote Sensing

Infrastructures for big data

How To Make Sense Of Data With Altilia

Big Data and Databases

Analytics on Big Data

Big Data Analytics. Lucas Rego Drumond

Big Data looks Tiny from the Stratosphere

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

GRAPH DATABASE SYSTEMS. h_da Prof. Dr. Uta Störl Big Data Technologies: Graph Database Systems - SoSe

Transcription:

BIG DATA INTEGRATION RESEARCH AT THE UNIVERSITY OF LEIPZIG ERHARD RAHM, UNIV. LEIPZIG www.scads.de

UNIVERSITY OF LEIPZIG Founded in 1409 Now about 30.000 students in 14 faculties Computer science 13 professorships and 2 junior professors 150 PhD students and postdocs (120 by third party funding) 2

GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin Big Data Center (BBDC) ScaDS Dresden/Leipzig (Competence Center for Scalable Data Services and Solutions Dresden/Leipzig) scientific coordinators: Nagel (TUD), Rahm (UL) start: Oct. 2014 duration: 4 years (option for 3 more years) initial funding: ca. 5.6 Mio. Euro 3

GOALS Bundling and advancement of existing expertise on Big Data Development of Big Data Services and Solutions Big Data Innovations 4

FUNDED INSTITUTES Univ. Leipzig TU Dresden Leibniz Institute of Ecological Urban and Regional Development Max-Planck Institute for Molecular Cell Biology and Genetics 5

ASSOCIATED PARTNERS Avantgarde-Labs GmbH Data Virtuality GmbH E-Commerce Genossenschaft e. G. European Centre for Emerging Materials and Processes Dresden Fraunhofer-Institut für Verkehrs- und Infrastruktursysteme Fraunhofer-Institut für Werkstoff- und Strahltechnik GISA GmbH Helmholtz-Zentrum Dresden - Rossendorf Hochschule für Telekommunikation Leipzig Institut für Angewandte Informatik e. V. Landesamt für Umwelt, Landwirtschaft und Geologie Netzwerk Logistik Leipzig-Halle e. V. Sächsische Landesbibliothek Staatsund Universitätsbibliothek Dresden Scionics Computer Innovation GmbH Technische Universität Chemnitz Universitätsklinikum Carl Gustav Carus 6

STRUCTURE OF THE CENTER Life sciences Material and Engineering sciences Environmental / Geo sciences Digital Humanities Business Data Service center Big Data Life Cycle Management and Workflows Data Quality / Data Integration Knowledge Extraktion Visual Analytics Efficient Big Data Architectures 7

RESEARCH PARTNERS Data-intensive computing W.E. Nagel Data quality / Data integration E. Rahm Databases W. Lehner, E. Rahm Knowledge extraction/data mining C. Rother, P. Stadler, G. Heyer Visualization S. Gumhold, G. Scheuermann Service Engineering, Infrastructure K.-P. Fähnrich, W.E. Nagel, M. Bogdan 8

APPLICATION COORDINATORS Life sciences G. Myers Material / Engineering sciences M. Gude Environmental / Geo sciences J. Schanze Digital Humanities G. Heyer Business Data B. Franczyk 9

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Big Graph Data Graph-based Business Intelligence with BIIIG GraDoop: Hadoop-based data management and analysis Summary and outlook 10

Heterogeneity Volume Velocity Privacy Human collaboration BIG DATA ANALYSIS PIPELINE Data aquisition Data extraction / cleaning Data integration/ annotation Data analysis and visualization Interpretation 11

BIG DATA INTEGRATION USE CASE INTEGRATION OF PRODUCT OFFERS IN COMPARISON PORTAL Thousands of data sources (shops/merchants) Millions of products and product offers Continous changes Many similar, but different products Low data quality 12

LEARNING-BASED MATCH APPROACH Pre-processing Training Product Code Extraction Training Data Selection Matcher Application Classifier Learning Product Offers Manufacturer Cleaning Application Classifier Automatic Classification Blocking (Manufacturer + Category) Matcher Application Classification Product Match Result 13

HOW TO SPEED UP OBJECT MATCHING? Blocking to reduce search space group similar objects within blocks based on blocking key restrict object matching to objects from the same block Parallelization split match computation in sub-tasks to be executed in parallel exploitation of Big Data infrastructures such as Hadoop (Map/Reduce or variations) 14

Re-Partitioning Grouping Grouping Grouping GENERAL OBJECT MATCHING WORKFLOW R S Blocking Similarity Computation Match Classification M R S Map Phase: Blocking Reduce Phase: Matching 15

DEDOOP: EFFICIENT DEDUPLICATION WITH HADOOP Parallel execution of data integration/ match workflows with Hadoop Powerful library of match and blocking techniques Learning-based configuration GUI-based workflow specification Automatic generation and execution of Map/Reduce jobs on different clusters This tool by far shows the most mature use of MapReduce for data deduplication www.hadoopsphere.com Automatic load balancing for optimal scalability Iterative computation of transitive closure (extension of MR-CC) 16

Core DEDOOP OVERVIEW General ER workflow T R S [0,1] Machine Learning R S Blocking Similarity Computation Match Classification M R S Classifier Training Job Data Analysis Job Dedoop s general MapReduce workflow Blocking-based Matching Job Decision Tree Logistic Regression SVM Standard Blocking Sorted Neighborhood PPJoin+ Blocking Key Generators Prefix Token-based Edit Distance n-gram TFIDF Threshold Match rules ML model 17

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Big Graph Data Graph-based Business Intelligence with BIIIG GraDoop: Hadoop-based data management and analysis Summary and outlook 18

PRIVACY FOR BIG DATA Need for comprehensive privacy support ( privacy by design ) Privacy-preserving publishing of datasets Privacy-preserving record linkage Privacy-preserving data mining Privacy-preserving record linkage object matching with encrypted data to preserve privacy conflicting requirements: high privacy, scalability and match effectiveness use of central linking unit (Trusted third party) vs. symmetric approaches (Secure Multiparty Computing) 19

PPRL WITH BLOOM FILTERS effective and simple encryption uses cryptographic bloom filters (Schnell et al, 2009) tokenize all match-relevant attribute values, e.g. using bigrams or trigrams typical attributes: first name, last name (at birth), sex, date of birth, country of birth, place of birth map each token with a family of one-way hash functions to fixed-size bit vector (fingerprint) original data cannot be reconstructed match of bit vectors (Jaccard similarity) is good approximation of true match result 20

SIMILARITY COMPUTATION - EXAMPLE thomas thoman tho hom oma mas tho hom oma man tho hom oma 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 tho hom oma 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 mas h1(mas)= 3 h2(mas)= 7 h3(mas)= 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 man h1(man)= 2 h2(man)= 0 h3(man)= 13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 1 Sim Jaccard (r1, r2) = (r1 ᴧ r2) / (r1 ᴠ r2) Sim Jaccard (r1, r2) = 7/11 21

PP-JOIN: POSITION PREFIX JOIN (XIAO ET AL, 2008) one of the most efficient similarity join algorithms determine all pairs of records with sim Jaccard (x,y) t use of filter techniques to reduce search space length, prefix, and position filter relatively easy to run in parallel good candidate to improve scalability for PPRL evaluate set bit positions instead of (string) tokens 22

LENGTH FILTER matching records pairs must have similar lengths Sim Jaccard (x, y) t x y t length / cardinality: number of set bits in bit vector Example for minimal similarity t = 0,8: ID Bit vector card. B C A 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 record B of length 4 cannot match with C and all records with greater length (number of set positions), e.g., A 4 7 8 length filter 7 * 0.8 = 5.6 > 4

Similar records must have a minimal overlap α in their sets of tokens (or set bit positions) Prefix filter approximates this test reorder bit positions for all fingerprints according to their overall frequency from infrequent to frequent exclude pairs of records without any overlap in their prefixes with Example (t = 0.8) PREFIX FILTER Sim Jaccard (x, y) t Overlap(x, y) α = é( t ( x ) + y ) ) 1+t prefix_length(x) = é ( (1-t) x ) + 1 ID reordered fingerprint card. prefix 0 0 0 fingerprint 1 1 1 B 1 0 1 0 0 0 0 0 1 1 0 0 0 0 4 1 0 1 C 0 0 0 1 1 1 1 1 1 1 0 0 0 0 7 0 0 0 1 1 1 A 0 1 0 1 1 1 1 1 1 1 0 0 0 0 8 0 1 0 1 1 AND operation on prefixes shows non-zero result for C and A so that these records still need to be considered for matching 24

P4JOIN: POSITION FILTER improvement of prefix filter to avoid matches even for overlapping prefixes estimate maximally possible overlap and checking whether it is below the minimal overlap α to meet threshold t original position filter considers the position of the last common prefix token revised position filter record x, prefix 1 1 0 1 length 9 record y, prefix 1 1 1 length 8 highest prefix position (here fourth pos. in x) limits possible overlap with other record: the third position in y prefix cannot have an overlap with x maximal possible overlap = #shared prefix tokens (2) + min (9-3, 8-3)= 7 < minimal overlap α = 8 25

EVALUATION comparison between NestedLoop, P4Join, MultiBitTree MultiBitTree: best filter approach in previous work by Schnell applies length filter and organizes fingerprints within a binary tree so that fingerprints with the same set bits are grouped within sub-trees can be used to filter out many fingerprints from comparison two input datasets R, S determined with FEBRL data generator N=[100.000, 200.000,, 500.000]. R =1/5 N, S =4/5 N bit vector length: 1000 similarity threshold 0.8 26

EVALUATION RESULTS runtime in minutes on standard PC Approach Dataset size N 100.000 200.000 300.000 400.000 500.000 NestedLoop 6,10 27,68 66,07 122,02 194,77 MultiBitTree 4,68 18,95 40,63 78,23 119,73 P4 Length filter only 3,38 20,53 46,48 88,33 140,73 P4 Length+Prefix 3,77 22,98 52,95 99,72 159,22 P4 Length+Prefix+Position 2,25 15,50 40,05 77,80 125,52 similar results for P4Join and Multibit Tree relatively small improvements compared to NestedLoop 27

GPU-BASED PPRL Operations on bit vectors easy to compute on GPUs Length and prefix filters Jaccard similarity Frameworks CUDA und OpenCL support data-parallel execution of general computations on GPUs program ( kernel ) written in C dialect limited to base data types (float, long, int, short, arrays) no dynamic memory allocation (programmer controls memory management) important to minimize data transfer between main memory and GPU memory 28

card prefix card prefix card prefix card prefix bits matches bits bits bits EXECUTION SCHEME partition inputs R and S (fingerprints sorted by length) into equallysized partitions that fit into GPU memory generate match tasks per pair of partition only transfer to GPU if length intervals per partition meet length filter optional use of CPU thread to additionally match on CPU S Match task GPU change S 3 with S 4 R 0 S 0 S 1 S 2 S 3 S 4 GPU thread R 0 S 4 Replace S 3 with S 4 R 0 S 3 M R 0,S3 Kernel 0 r 0 -S 4 R R 1 R 2 R 3 Read M R 0,S4 r R 0-1-S 4 Kernel R 0-1 CPU thread(s) Main memory (host) GPU memory 29

GPU-BASED EVALUATION RESULTS GeForce GT 610 48 Cuda Cores@810MHz 1GB 35 GeForce GT 540M 96 Cuda Cores@672MHz 1GB 100.000 200.000 300.000 400.000 500.000 GForce GT 610 0,33 1,32 2,95 5,23 8,15 GeForce GT 540M 0,28 1,08 2,41 4,28 6,67 improvements by up to a factor of 20, despite low-profile graphic cards still non-linear increase in execution time with growing data volume 30

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Big Graph Data Graph-based Business Intelligence with BIIIG GraDoop: Hadoop-based data management and analysis Summary and outlook 31

GRAPHS ARE EVERYWHERE Social science Engineering Life science Information science Facebook Internet ca. 1.3 Billion users ca. 2.9 Billion Users ca. 340 friends per user Twitter ca. 300 Million users ca. 500 Million Tweets per day Gene (human) 20,000-25,000 ca. 4 Million individuals Patients > 18 Millionen (Germany) Illnesses > 30.000 World Wide Web ca. 1 Billion Websites LOD-Cloud ca. 31 Billion Triples 32

USE CASE: GRAPH-BASED BUSINESS INTELLIGENCE Business intelligence usually based on relational data warehouses enterprise data is integrated within dimensional schema analysis limited to predefined relationships no support for relationship-oriented data mining Graph-based approach (BIIIG) Integrate data sources within an instance graph by preserving original relationships between data objects (transactional and master data) Determine subgraphs (business transaction graphs) related to business activities Analyze subgraphs or entire graphs with aggregation queries, mining relationship patterns, etc. 33

SAMPLE GRAPH 34

BIIIG DATA INTEGRATION AND ANALYSIS WORKFLOW Business Intelligence on Integrated Instance Graphs 35

SCREENSHOT FOR NEO4J IMPLEMENTATION 36

GRAPH DATA MANAGEMENT Relational database systems store vertices and edges in tables utilize indexes, column stores, etc. Graph database system, e.g. Neo4J use of property graph data model: vertices and edges have arbitrary set of properties ( represented as key-value pairs ) focus on simple transactions and queries Distributed graph processing systems, e.g., Google Pregel, Apache Giraph, GraphX, etc. In-memory storage of graphs in Shared Nothing cluster parallel processing of general graph algorithms, e.g. page rank, connected components, 37

WHAT S MISSING? A comprehensive framework and research platform for efficient, distributed and domain independent graph analytics. 38

GRADOOP CHARACTERISTICS Hadoop-based framework for graph data management and analysis Graph storage in scalable distributed store, e.g., HBase Extended property graph data model operators on graphs and sets of (sub) graphs support for semantic graph queries and mining Leverages powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Pig, Drill New functionality for graph-based processing workflows and graph mining 39

Backend Frontend GRADOOP HIGH LEVEL ARCHITECTURE Graph Visualization Visual Workflow Definition Graph Operators Pipeline Execution Control Data Bulk Load Graph Repository Bulk Write Distributed Storage Shared Nothing Cluster 40

EXTENDED PROPERTY GRAPH MODEL Partitioned Directed Labeled Attributed Multigraph a A B C k1 : v1 c 0 1 2 k1 : v1 k2 : v2 k3 : v3 b k1 : v1 k2 : v2 k1 : v1 k2 : v2 k1 : v1 k2 : v2 d G0 (A) k1 : v1 k2 : v2 G1 (A,B) k1 : v1 41

GRADOOP OPERATORS Single Graph Operations Graph Set Operations Binary Graph Comparison n-ary Graph Comparison Operator Input Output Aggregation γ: G R Σ G g Subgraph Discovery θ υ,ε : G G G G Graph G Graph G Vertex map υ: V G Edge map ε: E G Number/String g Graph set G Operator Input Output Selection σ φ : G G G G Graph set G Predicate φ: G {0,1} Graph set G Operator Input Output Similarity ~: G G R G 1, s G 2 Graphs G 1, G 2 Similarity s Operator Input Output Frequent Subgraphs φ t : G G G G Graph set G Treshold 0 t 1 Graph set G Summarization Pattern Match Projection Map Union Intersect Difference Edit Steps Equivalence Equality Inner Join Outer Join 42

IMPLEMENTATION STATUS Gradoop-BIIIG BTG Analysis Pipeline Data Import Selection Aggregation Gradoop- MapReduce Gradoop-Giraph Giraph 1.1.0 Subgraph Discovery I/O Formats Gradoop core EPG Model HBaseGraphStore Bulk Load I/O Formats Hadoop 1.2.1 Hbase 0.98.7 43

BIIIG WITH GRADOOP Bulk Load G Subgraph G Selection G Aggregation G, g Discovery Foodbroker Integrated Instance Graph 44

AGENDA ScaDS Dresden/Leipzig Big Data Integration Introduction Matching product offers from web shops DeDoop: Deduplication with Hadoop Privacy-preserving record linkage with PP-Join Cryptographic bloom filters Privacy-Preserving PP-Join (P4Join) GPU-based implementation Big Graph Data Graph-based Business Intelligence with BIIIG GraDoop: Hadoop-based data management and analysis Summary and outlook 45

SUMMARY ScaDS Dresden/Leipzig Research focus on data integration, knowledge extraction, visual analytics broad application areas (scientific + business-related) solution classes for applications with similar requirements Big Data Integration Big data poses new requirements for data integration (variety, volume, velocity, veracity) comprehensive data preprocessing and cleaning Hadoop-based approaches for improved scalability, e.g. Dedoop Usability: machine-learning approaches, GUI, 46

SUMMARY (2) Scalable Privacy-Preserving Record Linkage bloom filters allow simple, effective and relatively efficient match approach Privacy-preserving PP-Join (P4JOIN) achieves comparable performance to multibit trees but easier to parallelize GPU version achieves significant speedup further improvements needed to reduce quadratic complexity Big Graph Data high potential of graph analytics even for business data (BIIIG) GraDoop: infrastructure for entire processing pipeline: graph acquisition, storage, integration, transformation, analysis (queries + graph mining), visualization leverages Hadoop ecosystem including graph processing systems extended property graph model with powerful operators 47

OUTLOOK Parallel execution of more diverse data integration workflows for text data, image data, sensor data, etc. learning-based configuration to minimize manual effort (active learning, crowd-sourcing) Holistic integration of many data sources (data + metadata) clustering across many sources N-way merging of related ontologies (e.g. product taxonomies) Improved privacy-preserving record linkage better scalability, also for n-way (multi-party) PPRL Big Graph data management complete processing framework improved usability 48

REFERENCES H. Köpcke, A. Thor, S. Thomas, E. Rahm: Tailoring entity resolution for matching product offers. Proc. EDBT 2012: 545-550 L. Kolb, E. Rahm: Parallel Entity Resolution with Dedoop. Datenbank-Spektrum 13(1): 23-32 (2013) L. Kolb, A. Thor, E. Rahm: Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12), 2012 L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012: 618-629 L. Kolb, Z. Sehili, E. Rahm: Iterative Computation of Connected Graph Components with MapReduce. Datenbank-Spektrum 14(2): 107-117 (2014) A. Petermann, M. Junghanns, R. Müller, E. Rahm: BIIIG : Enabling Business Intelligence with Integrated Instance Graphs. Proc. 5th Int. Workshop on Graph Data Management (GDM 2014) A. Petermann, M. Junghanns, R. Müller, E. Rahm: Graph-based Data Integration and Business Intelligence with BIIIG. Proc. VLDB Conf., 2014 E. Rahm, W.E. Nagel: ScaDS Dresden/Leipzig: Ein serviceorientiertes Kompetenzzentrum für Big Data. Proc. GI-Jahrestagung 2014: 717 R.Schnell, T. Bachteler, J. Reiher: Privacy-preserving record linkage using Bloom filters. BMC Med. Inf. & Decision Making 9: 41 (2009) Z. Sehili, L. Kolb, C. Borgs, R. Schnell, E. Rahm: Privacy Preserving Record Linkage with PPJoin. Proc. BTW Conf. 2015 C. Xiao, W. Wang, X. Lin, J.X. Yu: Efficient Similarity Joins for Near Duplicate Detection. Proc. WWW 2008 49