Big Data and Analytics at the IRS:



Similar documents
CDW DATA QUALITY INITIATIVE

Big Data Analytics Nokia

MDM and Data Warehousing Complement Each Other

EII - ETL - EAI What, Why, and How!

Chapter 5. Learning Objectives. DW Development and ETL

High-Volume Data Warehousing in Centerprise. Product Datasheet

Advanced In-Database Analytics

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

SQL Server 2012 Gives You More Advanced Features (Out-Of-The-Box)

NEWLY EMERGING BEST PRACTICES FOR BIG DATA

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

What's New in SAS Data Management

BUSINESSOBJECTS DATA INTEGRATOR

Introduction to Data Mining

BUILDING OLAP TOOLS OVER LARGE DATABASES

<Insert Picture Here> Oracle Retail Data Model Overview

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Accelerate Data Loading for Big Data Analytics Attunity Click-2-Load for HP Vertica

BENEFITS OF AUTOMATING DATA WAREHOUSING

The Data Mining Process

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

SAP SE - Legal Requirements and Requirements

Enterprise Data Integration The Foundation for Business Insight

ORACLE TAX ANALYTICS. The Solution. Oracle Tax Data Model KEY FEATURES

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Agile Business Intelligence Data Lake Architecture

Oracle Big Data SQL Technical Update

An Architectural Review Of Integrating MicroStrategy With SAP BW

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Making Sense of Big Data in Insurance

Why is Internal Audit so Hard?

Lavastorm Resolution Center 2.2 Release Frequently Asked Questions

Getting Started Practical Input For Your Roadmap

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

Putting Apache Kafka to Use!

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Luncheon Webinar Series May 13, 2013

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Informatica ILM Archive and Application Retirement

Fusion Applications Overview of Business Intelligence and Reporting components

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

Tax Fraud in Increasing

Ganzheitliches Datenmanagement

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Getting Value from Big Data with Analytics

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Improving your Data Warehouse s IQ

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Business-driven governance: Managing policies for data retention

Overview. Edvantage Security

BUSINESSOBJECTS DATA INTEGRATOR

Cloud Ready Data: Speeding Your Journey to the Cloud

Oracle Data Integrator 12c: Integration and Administration

III JORNADAS DE DATA MINING

Service Oriented Architecture and the DBA Kathy Komer Aetna Inc. New England DB2 Users Group. Tuesday June 12 1:00-2:15

Oracle Business Intelligence Foundation Suite 11g Essentials Exam Study Guide

Advanced Analytics for Audit Case Selection

Industry Models and Information Server

Oracle Data Integrator 11g: Integration and Administration

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Advanced Big Data Analytics with R and Hadoop

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Oracle Architecture, Concepts & Facilities

Leveraging Machine Data to Deliver New Insights for Business Analytics

Corralling Data for Business Insights. The difference data relationship management can make. Part of the Rolta Managed Services Series

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Master of Science in Health Information Technology Degree Curriculum

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

SQL Server 2005 Features Comparison

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Exploring the Synergistic Relationships Between BPC, BW and HANA

Enterprise Information Management and Business Intelligence Initiatives at the Federal Reserve. XXXIV Meeting on Central Bank Systematization

Data Integration and ETL with Oracle Warehouse Builder NEW

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

BBBT Podcast Transcript

Data Mart/Warehouse: Progress and Vision

Information and Decision Sciences (IDS)

Introduction to Glossary Business

OWB Users, Enter The New ODI World

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

Bringing Strategy to Life Using an Intelligent Data Platform to Become Data Ready. Informatica Government Summit April 23, 2015

Reverse Engineering in Data Integration Software

The Future of Business Analytics is Now! 2013 IBM Corporation

Azure Scalability Prescriptive Architecture using the Enzo Multitenant Framework

Oracle Business Intelligence 11g Business Dashboard Management

Integrating data in the Information System An Open Source approach

Introduction. A. Bellaachia Page: 1

Transcription:

Big Data and Analytics at the IRS: Perspectives and Initatives Government Big Data Symposium March 5-6, 2013 Jeff Butler Director, Research Databases IRS, Research, Analysis, and Statistics jeff.butler@irs.gov

Background The Internal Revenue Service (IRS) has a large service and enforcement footprint. The table below is from FY 2011. Tax Return Processing Account Management Customer Service Enforcement 234 million tax returns filed 1.8 billion third-party information returns $2.4 trillion in gross receipts 122 million refunds totaling $415 billion 319 million vists to IRS website 83 million toll-free telephone calls 223 million letters or notices sent to taxpayers $116 billion in accounts receivable 2

Types of Research and Analysis Taxpayer Behavior Failure to file or pay Abusive tax shelters Identity theft Return preparer compliance Misreporting income or deductions Refund fraud Off-shore transactions Financial crimes Analytic Initiatives Identify patterns of filing and payment non-compliance Predict and prevent ID theft and refund dfraud Estimate U.S. tax gap Measure taxpayer py burden Optimize case inventories and treatment strategies Simulate effects of tax changes Analyze criminal networks 3

Analytic Data Environment in IRS IRS enterprise IT manages hundreds of transactional systems and applications Research organization integrates legacy and third-party data into the Compliance Data Warehouse (CDW) Compliance Data Warehouse (CDW) Selected Metrics Total data size ~ 1.3PB Number of database tables ~ 3,100 Number of unique columns ~ 52,500 Number of searchable metadata attributes > 1 million Number of users ~ 1,020 Average daily queries ~ 6,500 4

IRS Analytic Data Environment Compliance Data Warehouse (CDW) Analytic Sandboxes (Examples) Case Predictive Text Optimization i Modeling Analytics Simulation Data Integration Layer Core Analytic Database nterprise Data a E Integration La ayer Data Statistical & Mathematical Analysis Storage Mgmt Security/Audit Monitoring Ad-Hoc Query and Reporting Infrastructure and Services System Admin Software Config Accounts Metadata Data Profiling Data Extracts, Matching Web Services Training & Support 5

IRS Analytic Data Environment Compliance Data Warehouse (CDW) Core Database Servers (Sybase IQ, Oracle, SQL Server) Shared Storage (>2PB) (DB, Backup, Staging, User) Application/Web Servers (SAS, R, Hyperion) IRS Network Users & Projects Systems & Applications Analytic Sandboxes Other Tools 6

Scale (Volume) 1600 Data Size (Terabytes) 7000 Average Daily Queries 6000 1200 5000 800 4000 3000 400 2000 1000 0 2005 2007 2009 2011 2013 0 2005 2006 2007 2008 2009 2010 2011 2012 Third-Party Tools Web-Based Not all infrastructure/service costs are constant in scale Massively large environments can have asymmetric challenges Systems & Storage Management ETL & Database Administration Metadata & Web Services Security Audit and Monitring Tools, Training, & Support Analytic Sandboxes 7

Challenges with Scale I/O bottlenecks when data are off-loaded for analytics Single biggest problem for users in massively large environments Strategy: Maximize in-database analytics where possible Finding the optimal mix of ETL tools and techniques This is still where data warehousing costs are highest Strategy: Stay nimble and avoid one-size-fits-all solution Choosing the right database technology Is it performance or scale that s really needed? CDW is largest database in the IRS and still uses columar DB Strategy: Maximize performance for users at smallest O&M cost Storage management Different approach needed in user-based analytic environment Strategy: t Partition file systems based on user intensity it 8

terly Monthly Weekly Daily Annual Quart Data Arrival Rate 2003 2005 2007 2009 2011 2013 Timeliness (Velocity) 140 120 100 80 60 40 20 0 Ingest-Release Latency 2005 2006 2007 2008 2009 2010 2011 2012 Data arrival rates are different from data delivery rates Minimzing this difference is inherently an ETL problem Data Extract/ Feed Validation/ Integration/ Preprocessinprocessing Post- Analysis/ Modeling Interpretation/ Action 9

Challenges with Velocity Larger the data size, longer the processing time Let P ij and S ij = processing time and size of data set i with frequency j, ij = 1, 2,, n The problem is argmin θ ij (P S) ij + ε ij Processing time varies with scale (and complexity) Disturbances ε ij are unavoidable (e.g., server maintenance) Data may require validation, standardization, and cleaning No two data sets are the same Structured vs. unstructured data What is impact of frequent schema changes on data delivery times for structured data? Do skills exist for processing unstructured data at any speed? 10

Heterogeneity (Variety) Sources of IRS Data Types of IRS Data Source Systems and Data Formats Taxpayers Employers Preparers Banks Brokers Non-Profits Interagency Fed/State Treaty Partners Intermediaries Forms Schedules Worksheets Attachments Images Correspondence Transactions Phone Calls Notices Transcripts Mainframe Unix/Linux Windows Databases VSAM Flat Files Applications DB tables Fixed format Hierarchical Delimited Packed decimal XML Plain text Overwhelming majority of IRS data are still structured Most transaction systems are still file-based Challenge: skills needed to parse and analyze text Information extraction and entity resolution techniques (NLP) 11

60000 50000 Metadata and Information Quality Searchable Metadata Framework and Strategy Simple reference model is used to guide consisteny of searchable artifacts 40000 Combination of system, contextual, and application attributes 30000 Controlled vocabulary for key 20000 descriptive elements 10000 0 2005 2006 2007 2008 2009 2010 2011 2012 Columns Columns w ith Metadata Strategy favors basic discoverability rather than systematized collections Data for analytics must be searchable, understandable, and semantically consistent Metadata is the nucleus of any data quality strategy Trust and confidence in data should be invariant to scale 12

Metadata and Information Quality Stages of Metadata Collection Database Flat File VSAM Extract Transform Load Validate Staging DW Roll-Ups Query, Analys sis, Reportin g Source Systems Source Metadata ETL/T Metadata Data Model Metadata Report Metadata Central Metadata Repository Metadata are collected at each stage of the data supply chain 13

Metadata and Information Quality System Metadata Physical properties, data movement, ETL/T, and workflow artifacts Contextual Metadata Attributes, references, and other searchable content Application Metadata Context dependent logic, conditional rules, and dynamic processing Source System Characteristics System properties File or table names Data element names and definitons Data types Transformation rules Cross-references references Target System Properties Table names Column names Data types Indexes Partitions or table spaces Data Attributes Authoritative system Data element name and definiton Availability Data type Join paths Legacy source reference User reviews Links to context-dependent data Publishing Standards Web-based Standard format Hierarchical and free-form search Web-Based Logic Reports and roll-ups Lookup tables URLs and other links External communication Profiling Frequencies Statistical distributions Trend analysis Geographic maps Reviews User ID Table/column reference Feedback 14

Techniques used by IRS analysts Workforce Skills Regression-based methods (GLM, logisitic, quantile, non-linear, proportional hazards) Social network analysis, graph theory Machine learning (neural networks, SVMs, genetic algorithms) Multivariate statistical methods (discriminant analysis, clustering, density estimation, factor analysis) Simulation (Monte Carlo, MCMC, agent-based modeling) Decision trees (CART, CHAID, C5, hybrids) Bayes rules and other classifiers Variance estimation with complex samples 15

Workforce Skills Analysts: Use of advanced SQL techniques to avoid off-loading data for analytics (in-database dtb computing) Understanding and leveraging Open Source tools IT Staff: Literacy in non-traditional computing architectures Support for Open Source tools and analytic databases Ability to quickly build and deploy analytic sandboxes This is different from typical BI/report/dashboard environments Emphasis on algorithms, not just information distribution Key is multi-disciplinary skills Nexus of statistics, computer science, economics, IT 16

Data Privacy and Security IRS analytics are done behind the firewall but data still moves Data off-loaded to laptops, servers, sandboxes External access (Treasury, Congress, universities) Permissions management in shared disk environment Gets more complex with more users and data Security trade-offs and challenges Impact of system- and application-level policy changes How much continuous monitoring and auditing? FISMA and the documentation dilemma Relationship between encryption and performance 17