Return on Experience with Data Quality Tools at Smals



Similar documents
How. Matching Technology Improves. White Paper

DBMS / Business Intelligence, SQL Server

G O L D C O L L E C T I O N

Data Warehousing. Jens Teubner, TU Dortmund Winter 2014/15. Jens Teubner Data Warehousing Winter 2014/15 1

GMP-Z Annex 15: Kwalificatie en validatie

Welkom! Copyright 2014 Oracle and/or its affiliates. All rights reserved.

10426: Large Scale Project Accounting Data Migration in E-Business Suite

Examen Software Engineering /09/2011

Data Integrity and Integration: How it can compliment your WebFOCUS project. Vincent Deeney Solutions Architect

Personnalisez votre intérieur avec les revêtements imprimés ALYOS design

Requirements Lifecycle Management succes in de breedte. Plenaire sessie SPIder 25 april 2006 Tinus Vellekoop

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software

Relationele Databases 2002/2003

2013

Introduction. GEAL Bibliothèque Java pour écrire des algorithmes évolutionnaires. Objectifs. Simplicité Evolution et coévolution Parallélisme

Calcul parallèle avec R

Real-time customer information data quality and location based service determination implementation best practices.

In-Home Caregivers Teleconference with Canadian Bar Association September 17, 2015

Data Integration Checklist

Qu est-ce que le Cloud? Quels sont ses points forts? Pourquoi l'adopter? Hugues De Pra Data Center Lead Cisco Belgium & Luxemburg

HIGH PRECISION MATCHING AT THE HEART OF MASTER DATA MANAGEMENT

Five Fundamental Data Quality Practices

CO-BRANDING RICHTLIJNEN

Workshop. Avril 2015 Benoit Buonassera

Enterprise Data Quality

DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY

Patterns of Information Management

REQUEST FORM FORMULAIRE DE REQUÊTE

Data Quality & MDM Costs of making decisions on bad data. Vincent Lam Marketing Director

The Perfect Storm in IT

CSS : petits compléments

Travaux publics et Services gouvernementaux Canada. Title - Sujet RFSA FOR THE PROVISION OF SOFTWARE. Solicitation No. - N de l'invitation

Asset management in urban drainage

Evaluation of speech technologies

ORACLE DATA QUALITY ORACLE DATA SHEET KEY BENEFITS

Remote Method Invocation

Data Quality Assessment. Approach

Record Deduplication By Evolutionary Means

Testing for Duplicate Payments

Machine de Soufflage defibre

Travaux publics et Services gouvernementaux Canada. Title - Sujet FLEET MANAGEMENT SUPPORT SRVCS. Solicitation No. - N de l'invitation E60HP-11FMSS

Liste d'adresses URL

(Optioneel: We will include the review report and the financial statements reviewed by us in an overall report that will be conveyed to you.

Managing the Knowledge Exchange between the Partners of the Supply Chain

Trusted, Enterprise QlikViewreporting. data Integration and data Quality (It s all about data)

IBM InfoSphere Discovery: The Power of Smarter Data Discovery

Integrate 'Oracle Forms', 'Oracle Reports', 'Oracle

12/17/2012. Business Information Systems. Portbase. Critical Factors for ICT Success. Master Business Information Systems (BIS)

Troncatures dans les modèles linéaires simples et à effets mixtes sous R

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Audit de sécurité avec Backtrack 5

Modifier le texte d'un élément d'un feuillet, en le spécifiant par son numéro d'index:

Risk-Based Monitoring

Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System

CB Test Certificates

Inspection des engins de transport

Java Modules for Time Series Analysis

Account Manager H/F - CDI - France

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";! SET time_zone = "+00:00";!

IBM Predictive Analytics Solutions

Comprehensive Data Quality with Oracle Data Integrator. An Oracle White Paper Updated December 2007

SCHOLARSHIP ANSTO FRENCH EMBASSY (SAFE) PROGRAM 2016 APPLICATION FORM

Advanced record linkage methods: scalability, classification, and privacy

Why is Internal Audit so Hard?

TP JSP : déployer chaque TP sous forme d'archive war

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Memory Eye SSTIC Yoann Guillot. Sogeti / ESEC R&D yoann.guillot(at)sogeti.com

REQUEST FORM FORMULAIRE DE REQUÊTE

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Data Governance. David Loshin Knowledge Integrity, inc. (301)

TP1 : Correction. Rappels : Stream, Thread et Socket TCP

Tanenbaum, Computer Networks (extraits) Adaptation par J.Bétréma. DNS The Domain Name System

Privacy Aspects in Big Data Integration: Challenges and Opportunities

A Framework for Data Migration between Various Types of Relational Database Management Systems

REQUEST FORM FORMULAIRE DE REQUETE

ATP Co C pyr y ight 2013 B l B ue C o C at S y S s y tems I nc. All R i R ghts R e R serve v d. 1

Best Practices for Maximizing Data Performance and Data Quality in an MDM Environment

Load Balancing Lync Jaap Wesselius

Foundations of Business Intelligence: Databases and Information Management

Understanding Data De-duplication. Data Quality Automation - Providing a cornerstone to your Master Data Management (MDM) strategy

Note concernant votre accord de souscription au service «Trusted Certificate Service» (TCS)

DataFlux Data Management Studio

Régression logistique : introduction

Title Sujet Project Delivery Office Support. Solicitation No. Nº de l invitation Solicitation Closes L invitation prend fin.

Title Sujet: MDM Professional Services Solicitation No. Nº de l invitation Date: _A August 13, 2013

Waarom u nog niet naar de Cloud moet migreren

Travaux publics et Services gouvernementaux Canada. Solicitation No. - N de l'invitation E62ZR /A

Transcription:

Return on Experience with Data Quality Tools at Smals Dries Van Dromme @ULB, 13/03/2013

Contents Introduction Smals Introduction The Reality of Data Quality Functionalities of Data Quality Tools Organization for Data Governance Conclusions 2

Full IT service with large development components for new applications Infrastructure Social Security & Healthcare Application Development Employee Secondment ICT for work, health & family life 3

Smals: one of the largest IT organizations in Belgium 4

Smals: one of the largest IT organizations in Belgium >1700 Employees HQ South Station Turnover > 200 million >1200 ICT 5 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 # employees

Facts & Figures 2.393 m² 6.100 571 2.389 Datacenters Batch Applications Service Management 30.000.000 1.739 1.340 TB 585 2.753.908 Network Servers Storage Databases Middleware & Web Services 6

Some of our members 7

Some of our projects Orgadon 8

Contents Introduction Smals Introduction The Reality of Data Quality Functionalities of Data Quality Tools Organization for Data Governance Conclusions 9

The Reality of Data Quality (Problems) www Bontemps Yves 102, Rue Prince Royal 1050 Bruxelles Yves Beautemp s Koninklijke prinsstr 102 Yves Bontemps Rue du Prince Royal 1020 Elsene 102 1050 Ixelles 10

Contents Introduction Smals Introduction The Reality of Data Quality Functionalities of Data Quality Tools Organization for Data Governance Conclusions 11

Functionalities of Data Quality Tools Data Profiling Data Standaardisatie Data Matching / fuzzy matching / record linkage Real world examples Name- and address-cleansing, address validation Incoherence detection, fuzzy matching between DB s 12

Data Profiling Definition Formele audit van gegevens en documentatie Jack E. Olson, "Data Quality The Accuracy Dimension" input: werkelijke data + documentatie, metadata (kwaliteit onbekend) output: Data Quality Issues (kwaliteit in kaart gebracht) + ondersteuning correctie documentatie Data Profiling met Data Quality Tools Veld per veld: Column Property Analysis Structuur: referentiële constraints Consistency: business rules 13

Data Profiling Column Property Analyse, drill-down (1) (1)-(16) geanalyseerd tijdens load gegenereerde metadata Min, Max, Lengtes, Null Values, Unique Values Patronen, Distributies, Business Rules, Apostemp(12): postcode werkgever vb. patronen, a.h.v. Masks 14

Data Profiling Column Property Analyse, drill-down (2) 15

Data Profiling Referentiële integriteit, foreign keys (1)? 2 miljoen 250.000 bron 1: Authentieke bron(44) sleutel: Ind_nr bron 2: Source_secondaire(60) Identificatienummer verwijst naar Ind_nr's Confrontatie bron 1 bron 2 86 waarden niet in authentieke bron in 669 records (er zijn dus dubbels) 16

Data Profiling Referentiële integriteit, foreign keys (2)? 2 miljoen 250.000 17

Data Profiling Functional Dependencies Detectie, tijdens load quality sample 10.000 conflicten Verificatie, on demand exhaustief, 2.9 miljoen drill-down naar overzicht van conflicten indien Quality < 100% 18

Data Profiling Business Rules (1) 19

Data Profiling Business Rules (2) 20

Data Profiling - summary What s the most difficult part? DB-schema, Constraints, Business Rules, Documentation, Metadata (inaccurate, incomplete) Real data (complete, quality=?) Large % of effort - getting access - getting the metadata - getting the data - getting the data in the right (normal) form project Data Profiling DQTools Business people DQ Analyst team Metadata (corrected) Facts about data (quality =!) Data Quality Issues - lack of standardisation - schema not respected - rules not respected 21

Data Standardisation Definition Oplossen van gebrek aan standaardisatie binnen één databank overheen verschillende databanken transversaal datamanagement doorbreken van silo's oplossen van inconsistenties in het (her-)gebruik van data-concepten overheen verschillende bedrijven of instellingen Interne verwerkingsstap fuzzy matching best practice matching-resultaten worden betrouwbaarder 22

Data Standardisation simple domains Example: country codes origineel toegevoegd m.b.v. data quality tool 23

Data Standardisation complex domains Example: names and adresses 781114-269.56 Yves Bontemps Rue Prince Royal 102 Bruxelles Parsing 781114-269.56 Yves Bontemps Rue Prince Royal 102 Bruxelles Enrichment 781114-269.56 Yves Bontemps Rue du Prince Royal 102 Ixelles Male Koninklijke Prinsstraat Elsene Data Quality Tools - thousands of rules for Parsing - knowledge bases for Enrichment, often Regional 1050 24

Data matching Definition Omgaan met schrijffouten, onnauwkeurigheden, gebrek aan standaardisatie om links te kunnen leggen tussen records van één bron (dubbeldetectie) van meerdere bronnen (dubbel- of incoherentie-detectie) ook waar de datamodellen van de te vergelijken bronnen niet overeenstemmen (data integratie) met hoge performantie (>=1 miljoen/uur) en hoge flexibiliteit definitie vaak initieel onduidelijk wat mag als 'dubbel' of 'incoherent' beschouwd worden link anomaliebeheer: slechts indien definities gevalideerd formele anomalie, detectie mogelijk 25

Data Matching algorithms 26 Matching on two levels Field-per-field Per record (aggregation) Performance [!] Compairing N records with each other N(N-1)/2 DB A Record A nom prenom rue numero cp date_naiss match WINKLER W. E., Overview of Record Linkage and Current Research Directions, 2006. http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf Y N Y N Y DB A B Record B name 1st_name street housenbr postcode gender non-match

Data Matching algorithms Field-per-field Character strings (names, streetnames, numbers, wherever there is fuzziness) Two fields are the same AFSCA "=" Agence Fédérale pour la Sécurité de la Chaîne Alimentaire Yves "=" Yves Yves "=" yves Yves "=" Yvse Bontemps "=" Bontan Bontemps, Yves "=" Yves Bontemps 27

Data Matching algorithms Field per field Classes of "comparison routines" Booleans (classifiers) Rules & predicates Egale, Suffixe de Stems from Boulanger ~ Boulangerie S.A. ~ Société Anonyme Phonetics Soundex Metaphone Bontant ~ Bontemps Similaritybased Word-based Edit distances Lettres communes Bontemps ~ Bnotmps Token-based Cosine Recursives Office National des matières fissiles ~ Office des matières fissiles 28

Data Matching algorithms Field per field Classes of "comparison routines" Booleans: Rules & predicates Boolean relations is-equal-to is-a-prefix-of is-a-suffix-of is-an-abbreviation-of stems-from Exemples: Bontemps Bontemps Bont Bontemps Emps ~ Bontemps S.A. ~ Soc. Anon. Bakkerij Bakker 29

Data Matching algorithms Field per field Classes of "comparison routines" Booleans: Rules & Predicates Hand-coded rules Hard-wired (Java, C, COBOL, ) Rule engine (declarative rules: pre-defined, user-defined) May be packaged in commercial software (black-box?) Making the best use of domain knowledge is possible Fastidious Costly (in development & in maintenance) 30

Data Matching algorithms Field per field Classes of "comparison routines" Booleans: Phonetics Origin transcription errors: oral written Principle: Ex: Bontemps: Montand Bontant Bonton Bontand Bautemps word m representation of its pronunciation p(m) p(m) = p(m') m ~ m' 31

Data Matching algorithms Field per field Classes of "comparison routines" Booleans: Phonetics Russel Soundex Algorithm (1918) Garder première lettre Supprimer a,e,h,i,o,u,w,y Codage: 1: B,F,P,V 2: C,G,J,K,Q,S,X 3: D,T 4: L 5: M, N 6: R Garder 4 premiers symboles (padding avec 0) Exemple Bontemps BNTMPS B53512 B535 Bondant BNDNT B5353 B535 32

Data Matching algorithms Field per field Classes of "comparison routines" Booleans: Phonetics Alternatives: Metaphone et Double Metaphone NYSIIS et NameSearch Daitch-Mokotoff Slavic & German languages 54 entries Fonem French-oriented 64 rules Phonex, etc 33

Data Matching algorithms Field per field Classes of "comparison routines" Booleans (classifiers) Rules & predicates Egale, Suffixe de Stems from Boulanger ~ Boulangerie S.A. ~ Société Anonyme Phonetics Soundex Metaphone Bontant ~ Bontemps Similaritybased Word-based Edit distances Lettres communes Bontemps ~ Bnotmps Token-based Cosine Recursives Office National des matières fissiles ~ Office des matières fissiles 34

Data Matching algorithms Field per field Classes of "comparison routines" Similarity-based Remplacer l'approche booléenne (0,1) par une approche plus fine Idée: Similarité variable entre Bontemps, Smith, Potnamp, B0tenps. Bontemps Potnamp B0tenps k Smith 35

Edit Distance Data Matching algorithms Field per field Classes of "comparison routines" Similarity: Word-based Distance between two strings = number of «operations» to transform one into the other Insertion (I) Deletion (D) Substitution (S) Cost per "error" (operation) OCR, Clavier, etc. Exemple Bontemps Bnontemps (I) Bnontemps (D) Bnontemps (D) 3 operations Most popular: Levenshtein http://en.wikipedia.org/wiki/levenshtein_distance 36

Data Matching algorithms Field per field Classes of "comparison routines" Similarity: Word-based Jaro-Winkler Distance Longest string counts s window of (s/2) 1 Exemple s = 7 window = 3 m = 2 characters are in matching positions t = 1 further transposition within window d = 1/3 * (m/s1 + m/s2 + (m-t)/m) d = 1/3 * (2/7 + 2/7 + (2 1)/2 = 0.221 S L U I T E N S 1 T 1 R U 1 D E 1 L REF: http://en.wikipedia.org/wiki/jaro%e2%80%93winkler_distance 37

Data Matching algorithms Field per field Classes of "comparison routines" Similarity: Token-based Token-based metrics When? Multiple elements in compared fields Exemples Ignoring word order {Bontemps, Yves} = {Yves, Bontemps} Jaccard metric S T S T {{Bontemps;Yves}} {{Yves; Bontemps}} 2/2 = 1 38

Data Matching algorithms Field per field Classes of "comparison routines" Similarity: Token-based Often: weighting schemes are used E.g. Term Frequency / Inverse Document Frequency (TF-IDF) Uses Rarity of a word as a measure for importance Exemples: Organisme National des Déchets Radioactifs et des Matières Fissiles Office Nationale des Matières Fissiles et Déchets Radioactifs Office Nationale de la Sécurité Sociale 39

Data Matching algorithms Field per field Classes of "comparison routines" Similarity: Token-based Recursive approaches: Use "word-based" routine (Levenshtein, ) to determine similarity between words (Office ~ Ofice) Use "token-based" technique to combine the results Ref: Soft TF-IDF, etc. 40

Data Matching En pratique Comparaison phonétique (KBO): Décomposition en mots Mots mis sous forme phonétique. Variante de Soundex Ignore termes habituels (S.A.,...) Stemming Comparaison (prise en compte de l'ordre des mots) 41

Approaches: Data Matching algorithms Aggregate level "Per record" Deterministic It is always clear why the algorithms decided match or non-match Probabilistic (global scoring) Interpretation is difficult, if not impossible Fellegi & Sunter, 1969 Cfr. Machine Learning Pour chaque comparaison, deux probabilités: m = Prob. de Y si A et B sont un vrai match u = Prob. de Y si A et B sont un vrai unmatch Le score est calculé sur base de ces probabilités (m/u) Classification en fonction du score 42

Data Matching algorithms Aggregate level "Per record" 43

Data Matching - Performance Matching = comparing all records of DB A with all records of DB B (or comparing all records of a DB with each other) E.g. A : 100 000 rec. B : 100 000 rec. 10 000 000 000 comparisons to calculate! De-duplication or double-detection: N records N(N-1)/2 comparisons to calculate! Optimisations nécessaires 44

Data Matching - Performance Blocking technique: A critical dimension is chosen to form blocks or windows Records are only compared within windows Ex: Code Postal, Trade-off between precision and performance Therefore, the dimension chosen must be highly trustworthy (of good data quality) Requires indexation and sorting Multi-passes, or the combination of multiple windowing techniques, are possible 45

Data Matching Performance Illustration Window Key creation 46

Data matching Real world example: confrontation of 2 DB 2 databanken, L en R er bestaat geen 'vreemde sleutel'-relatie tussen L en R DQTool detecteert dubbels en organiseert in clusters DQTool legt link tussen beide databanken mogelijke fraude? 47

Data matching Real world example importance of internal standardisation step zonder met Resultaat (adresvalidatie, adrescleansing) gecorrigeerde postcode gestandaardiseerde straatnaam correct ingedeelde adreselementen (parsing) gecorrigeerde gemeentenaam dubbels gedetecteerd en georganiseerd in clusters 48

Data standardisation & matching Not only names & addresses 49

Data Quality Tools supporting iterative concertation with the business real Address Cleansing project example Elk match pattern is een hulp voor de analist bij het finetunen van matching routines (Discovery) Iteratief overleg met business i.v.m. definitie en afhandeling van elk match pattern (Validation) 110: betrouwbaar correctiescenario's 402: afwijking in de benaming + probleem met de postcode 135: grotere afwijking in de benaming Elk gevalideerd match pattern anomalie (anomaly_type_id) 50

DQ tools support data quality problem resolution e.g. Integrating incoherent sources Bv.: 'commonize' telefoonnummer voor alle matched records 908-845-1234 51

Contents Introduction Smals Introduction The Reality of Data Quality Functionalities of Data Quality Tools Organization for Data Governance Conclusions 52

result Applications update Organization for Data Governance Smals project approach L Business Logica -définitions (anomalie, ), -règles business, -modalités de correction ou standardisation, -algorithmes, -D Business Process -D Application, -D Contraintes DB, Validation résultat validé L logique, règle Batch DQ project Discovery résultat Business extract L L L DB A,B,C L documente Documen tation L documente DQ Tools L DQTeam documente enrich Gestion L Historique des ano batch / online extract Online DQ DQ Tools, J2EE, SaaS, L 53

Data Quality Improvement - continuous cycle Monitoring Discovery Deployment Validation 54

Organization for Data Governance Data Governance Projects: phases Discovery Validation (in close concert with the business) Deployment (what, how (solution architecture)) Governance (monitoring, lifecycle (incl. retirement)) 55

Contents Introduction Smals Introduction The Reality of Data Quality Functionalities of Data Quality Tools Organization for Data Governance Conclusions 56

Conclusions When to act (on DQ) Whenever the quality of the data has strategic importance Whenever the (low) quality of data entails recurrent costs and limits business capabilities consider fitness for use Whenever the use of the data changes Typical context: projects & governance progams, involving Data Integration, Data Migration, Reengineering Fuzzy matching (e.g. names, adresses) Standardisation, Master Data Management Incoherence detection (even fraud) & resolution Anomaly management 57

Conclusions How to act Because of typical DQ project nature three phases Discovery Validation Deployment Data Quality Tools are better at this, thanks to: Performantie Productiviteit Flexibiliteit Iterative process with business involvement critical success factor configure instead of developing code data matching >= 1 million records/hr GUI/visualize/drill-down/export format = ideal for business concertation 58 In our experience, Data quality tools enable users to efficiently tackle a wide range of data quality problems. Without such tools, it would be costly, time consuming, or even impossible to remediate such problems.

Conclusions Where to act Act at the source (of problems) cfr. Data Tracking BPR > Data Firewall Data Firewall BPR BPR: Business Process Reengineering 59

Annex Window Key creation Character choice routines 60

Annex Window Key Size distribution is important Matching complexity = o([window size]²) 61

Annex Field by field comparison routines Commercial tool example (deterministic) 62

Annex Dimensions of Data Quality Relevant Accurate Timely Complete Understood Trustworthy 63