Top 20 Data Quality Solutions for Data Science

Similar documents
Extraction Transformation Loading ETL Get data out of sources and load into the DW

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

Data Consistency Management Overview January Customer

Data Quality Assessment. Approach

Data warehouse Architectures and processes

ETL PROCESS IN DATA WAREHOUSE

Kafka & Redis for Big Data Solutions

news from Tom Bacon about Monday's lecture

Copyrighted , Address :- EH1-Infotech, SCF 69, Top Floor, Phase 3B-2, Sector 60, Mohali (Chandigarh),

James Serra Data Warehouse/BI/MDM Architect JamesSerra.com

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

What's New in SAS Data Management

ETL Process in Data Warehouse. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Test Data Management Best Practice

The Data Warehouse ETL Toolkit

SOLUTION BRIEF. JUST THE FAQs: Moving Big Data with Bulk Load.

Building a Data Warehouse

Developing a Load Testing Strategy

Using distributed technologies to analyze Big Data

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

A Scalable Data Transformation Framework using the Hadoop Ecosystem

ETL-EXTRACT, TRANSFORM & LOAD TESTING

10426: Large Scale Project Accounting Data Migration in E-Business Suite

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Building a Data Quality Scorecard for Operational Data Governance

SQL Performance for a Big Data 22 Billion row data warehouse

Data Integration and ETL Process

Salesforce Certified Data Architecture and Management Designer. Study Guide. Summer 16 TRAINING & CERTIFICATION

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Cloud Cruiser and Azure Public Rate Card API Integration

How to avoid building a data swamp

MassTransit vs. FTP Comparison

An Oracle White Paper March Best Practices for Real-Time Data Warehousing

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

MDM and Data Warehousing Complement Each Other

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

SQL SERVER TRAINING CURRICULUM

Data Virtualization A Potential Antidote for Big Data Growing Pains

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

ETL Anatomy 101 Tom Miron, Systems Seminar Consultants, Madison, WI

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

ORACLE ENTERPRISE DATA QUALITY PRODUCT FAMILY

High-Volume Data Warehousing in Centerprise. Product Datasheet

The Hadoop Distributed File System

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

<Insert Picture Here> Master Data Management

Application of Predictive Analytics for Better Alignment of Business and IT

RS MDM. Integration Guide. Riversand

Data Integration Checklist

Framework for Data warehouse architectural components

Enterprise Data Integration for Microsoft Dynamics CRM

Data Quality & MDM Costs of making decisions on bad data. Vincent Lam Marketing Director

Data warehousing with PostgreSQL

Service Oriented Data Management

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

CRM Magic with Data Migration & Integration

SAP HANA SPS 09 - What s New? HANA IM Services: SDI and SDQ

Data Governance in the Hadoop Data Lake. Kiran Kamreddy May 2015

ABAP SQL Monitor Implementation Guide and Best Practices

Patterns of Information Management

Business Intelligence: Effective Decision Making

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

Oracle Database In-Memory The Next Big Thing

Informatica Data Replication: Maximize Return on Data in Real Time Chai Pydimukkala Principal Product Manager Informatica

dbspeak DBs peak when we speak

Why enterprise data archiving is critical in a changing landscape

Conventional Files versus the Database. Files versus Database. Pros and Cons of Conventional Files. Pros and Cons of Databases. Fields (continued)

IBM WebSphere DataStage Online training from Yes-M Systems

Enterprise Data Quality

Security Analytics Topology

Move Data from Oracle to Hadoop and Gain New Business Insights

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Solving Your Big Data Problems with Fast Data (Better Decisions and Instant Action)

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Chapter 5. Learning Objectives. DW Development and ETL

Unified Big Data Processing with Apache Spark. Matei

The Data Quality Continuum*

A Design Technique: Data Integration Modeling

THE THREE "Rs" OF PREDICTIVE ANALYTICS

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

White Paper: FSA Data Audit

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Microsoft SQL Server to Infobright Database Migration Guide

Sitecore Health. Christopher Wojciech. netzkern AG. Sitecore User Group Conference 2015

SQL Server 2012 Gives You More Advanced Features (Out-Of-The-Box)

MDM for the Enterprise: Complementing and extending your Active Data Warehousing strategy. Satish Krishnaswamy VP MDM Solutions - Teradata

Using Version Control and Configuration Management in a SAS Data Warehouse Environment

Real-time Data Replication

SQL Server 2008 Core Skills. Gary Young 2011

How To Synchronize With A Cwr Mobile Crm 2011 Data Management System

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

Predictive Analytics with Storm, Hadoop, R on AWS

S o l u t i o n O v e r v i e w. Turbo-charging Demand Response Programs with Operational Intelligence from Vitria

Lection 3-4 WAREHOUSING

Master Data Management

Transcription:

Top 20 Data Quality Solutions for Data Science Data Science & Business Analytics Meetup Denver, CO 2015-01-21 Ken Farmer

DQ Problems for Data Science Loom Large & Frequently 4000000 Impacts Include: Strikingly visible defects Bad Decisions Loss of credibility Loss of revenue Types of Problems Include: Requirement & Design Defects Misinterpretation Errors Source Data Defects Process Errors 3000000 2500000 Widgets 3500000 2000000 1500000 1000000 500000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Months

Lets Talk about Solutions But keep in mind... Proportionality is Important

Solution #1: Quality Assurance (QA) If it's not tested it's broken - Bruce Eckel Tests of application and system behavior input assumptions performed after application changes Programmer Testing: Unit Testing Functional Testing QA Team Testing: Black Box Testing Regression Testing System and Integration Testing Data Scientist Testing: All of the above Especially Programmer

Solution #2: Quality Control (QC) Because the environment & inputs are ever-changing 3000000 Track & analyze record counts: - identify partial data sets - identify upstream changes 2500000 2000000 1500000 Records 1000000 500000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Track & analyze rule counts: - identify upstream changes Track & analyze replication & aggregation consistency: - identify process failures

Solution #3: Data Profiling Find out what data looks like BEFORE you model it Save enormous amount of time: - quickly get the type, size, cardinality, unk values - share profiling results with users Example Deliverables: - What is the distribution of data for every field? - How do partitions affect data distributions? - What correlations exist between fields? Name Field Number Wrong Field Cnt Type Min Max Unique Values Known Values Case Min Length Max Length Mean Length Top Values E F G C H I J M K e f250_499 59 0 string C M 12 11 mixed 1 1 1.0 x x x x x x x x x x x 10056 occurrences 1299 occurrences 969 occurrences 358 occurrences 120 occurrences 89 occurrences 36 occurrences 18 occurrences 12 occurrences 11 occurrences 4 occurrences

Solution #4: Organization So people don't use the wrong data or the right data incorrectly Manage Historical Data: Simplify Data Models: Migrate old data Schemas Rules Curate adhoc files Segregate Name Document Eliminate Consistency in naming, types, defaults Simplicity in relationships and values

Solution #5: Process Auditing Because with a lot of moving parts comes a lot of failures Addresses: Slow processes Broken processes Challenge with Streaming: Helps to create artificial batch concept from timestamp within the data Helps identify: Duplicate loads Missing files Pipeline status Features: Tracks all processes: start, stop, times, return codes, batch_ids Alerting Example Products: Graphite - focus: resources Nagios - focus: process failure Or what's built-into every ETL tool (the inspecting metaphor, not the searching metaphor)

Solution #6: Full Automation Because manual processes account for a majority of problems Solution Characteristics: Addresses: Duplicate data Missing data Top Causes: manual process restarts manual data recoveries manual process overrides - Test catastrophic failures - Automate failure recoveries - Consider Netflick's Chaos Monkey

Solution #7: Changed Data Capture Because identifying changes is harder than most people realize Typical Alternatives: Application Timestamp: - pro: may already be there - con: reliability challenges Triggers: - pro: simple for downstream - con: reliability challenges - con: requires changes to source File Image Delta Example Source Data Target Data Sort Sort Dedup Dedup File Delta & Transform File-Image Delta: - pro: implementation effort - con: very accurate Same Inserts Deletes Change Change New Old

Solution #8: Master Data Management Because sharing reference data eliminates many issues Industry Addresses: Data consistency between multiple systems States Lat/Long Geocoding GeoIP Features: Centralized storage of reference data Versioning of data Access via multiple protocols Products Customers Master Data Management Sales Regions Products Finance Billing CRM SFA DW

Solution #9: Extrapolate for Missing Data Because if done well it can simplify queries Features: Requires sufficient data to identify pattern Identify generated data (see Quality Indicator & Dimension)

Solution #10: Crowd-sourced Data Cleansing Because cleansing & enrichening data can benefit from consensus Features: Collect consensus answers from workers Workers can be from external market Workers can be from your team Example Product: CrowdFlower Mechanical Turk Simple Data Scenario: Complex Data Scenario: Correct obvious problems Spelling Grammar Missing descriptions Use public resources Correct sophisticated problems Provide scores Provide descriptions Correct complex data Use internal & external resources Leverage crowdsourcing services for coordination

Solution #11: Keep Original Values Because your transformations will fail & you will change your mind Options: keep archive copy of source data Pro: can be very highly compressed Pro: can be kept off-server Con: cannot be easily queried Or keep with transformed data Pro: can be easily queried Con: may be used when it should not be Con: may double volume of data to host os_orig os Win2k win2000 MS Win win2000 Win 2000 win2000 Windoze 2k win2000 Windows 2000 SP4 win2000 MS Win 2k SP2 win2000 ms win 2000 server ed win2000 win2ksp3 win2000 Win 2k server sp9 win2000

Solution #12: Keep usage logs Because knowing who got data can help you minimize impact Solution Types: Log application queries Pro: database audit tools can do this Con: requires process audit logs to translate to data content Store data that was delivered Pros: can precisely identify who got what when Con: requires dev, only works with certain tools (ex: restful API) Con: doesn't show what wasn't delivered

Solution #13: Cleanup at Source Because it's cheaper to clean at the source than downstream Always be prepared to: Clean & scrub data in-route to target database But always try to: give clean-up tasks to source system

Solution #14: Static vs Dynamic, Strong vs Weak Type & Structure Because this is debated endlessly Static vs Dynamic Schemas: - Dynamic Examples: MongoDB, JSON in Postgres, etc - Dynamic Schemas optimize for writer at cost of reader Static vs Dynamic Typing: - static typing provide fewer defects* - but maybe not better data quality Declarative Constraints: - Ex: primary key, foreign key, Uniqueness, and check constraints - Code: ALTER TABLE foo ADD CONSTRAINT ck1 CHECK(open_date <= close_date)

Solution #15: Data Quality Indicator & Dimension Because it allows users to know what is damaged Example: Single id that represents status for multiple fields Bitmap example: bitmap 16-bit integer each bit represents a single field bit value of 0 ==, value of 1 == Translate integer to field status with UDF or table quality_id result col1 col2 col3 col4 col5 col6 col7 col8 coln 0000000000000000 0000000000000001 0000000000000010 0000000000000011 0000000000000100 0000000000000101 0000000000000111

Solution #16: Generate Test Data Because production data is of limited usefulness Various Types: Deterministic: Realistic: Contains very consistent data Great for benchmarking cheap to build Produced through a simulation Great for demos Great for sanity-checking analysis Hard to build Extreme: Contains extreme values Great for bounds-checking cheap to build

Solution #17: Use the Data! Data unused: Will decay over time Will lose pieces of data Will lose metadata Data in a report: Looks great Can see obvious defects But doesn't drive, doesn't get tested Data driving a business process: Looks great Gets tested & corrected in order to run the business.

Solution #18: Push Transformations Upstream Because you don't want to become source implementation experts MOVE From Data Scientists to ETL Developers: Eliminates inconsistencies between runs Scientist Source DW / Hadoop / etc Scientist Scientist Scientist MOVE From ETL Developers to Source System: Eliminate unnecessary source system knowledge Decouples systems Scientist Source DW / Hadoop / etc Scientist Scientist Scientist

Solution #19: Documentation (Metadata) Field Metadata: Name Description Type Length Unknown value Case Security Extended Metadata: Lineage Data Profile Common values/codes Their meaning Their frequency Validation rules & results Transformation rules Source Code: if gender == 'm': return 'male' else: return 'female' Report & Tool Documentation: Description Filtering Rules Transformation Rules

Solution #20: Data Defect Tracking Because you won't remember why a set of data was in 6 months Like Bug-Tracking, but for sets of data: Will explain anomalies later Can be used for data annotation Is simple, just needs to be used

Bonus Solution #21: Change the Culture (ha) Because you need support for priorities, resources, and time Single Most Important thing to do: Establish policy of transparency = 90% Share data with customers, stakeholders, owners, users Everything else results from transparency: Establish policy of automation Establish policy of measuring Plus everything we already covered What doesn't work? Ask management to mandate quality

Resources & Thanks International Association for Information & Data Quality (IAIDQ) http://www.iqtrainwrecks.com/ Improving Data Warehouse and Business Information Quality, Larry English The Data Warehouse Institute (TDWI) 1-A Large Scale Study of Programming Languages and Code Quality in Github http://macbeth.cs.ucdavis.edu/lang_study.pdf

Solution List Solution ETL DEST/DW Consume 1. QA LOW MEDIUM 2. QC MEDIUM MEDIUM 3. Data Profiling Source 4. Organize 5. Process Auditing MEDIUM 6. Full Automation 7. Changed Data Capture 8. MDM 9. Extrapolate Missing Data 10. Crowdsource Cleansing MEDIUM 11. Keep original values MEDIUM 12. Keep usage logs 13. Cleanup at source 14. Static vs Dynamic MEDIUM LOW 15. DQ Dimension 16. Generate Test Data 17. Use the Data 18. Push Transforms Upstream 19. Documentation MEDIUM MEDIUM 20. Data Defect Tracking 21. Change the Culture