Big Data and Analytics at the IRS: Perspectives and Initatives Government Big Data Symposium March 5-6, 2013 Jeff Butler Director, Research Databases IRS, Research, Analysis, and Statistics jeff.butler@irs.gov
Background The Internal Revenue Service (IRS) has a large service and enforcement footprint. The table below is from FY 2011. Tax Return Processing Account Management Customer Service Enforcement 234 million tax returns filed 1.8 billion third-party information returns $2.4 trillion in gross receipts 122 million refunds totaling $415 billion 319 million vists to IRS website 83 million toll-free telephone calls 223 million letters or notices sent to taxpayers $116 billion in accounts receivable 2
Types of Research and Analysis Taxpayer Behavior Failure to file or pay Abusive tax shelters Identity theft Return preparer compliance Misreporting income or deductions Refund fraud Off-shore transactions Financial crimes Analytic Initiatives Identify patterns of filing and payment non-compliance Predict and prevent ID theft and refund dfraud Estimate U.S. tax gap Measure taxpayer py burden Optimize case inventories and treatment strategies Simulate effects of tax changes Analyze criminal networks 3
Analytic Data Environment in IRS IRS enterprise IT manages hundreds of transactional systems and applications Research organization integrates legacy and third-party data into the Compliance Data Warehouse (CDW) Compliance Data Warehouse (CDW) Selected Metrics Total data size ~ 1.3PB Number of database tables ~ 3,100 Number of unique columns ~ 52,500 Number of searchable metadata attributes > 1 million Number of users ~ 1,020 Average daily queries ~ 6,500 4
IRS Analytic Data Environment Compliance Data Warehouse (CDW) Analytic Sandboxes (Examples) Case Predictive Text Optimization i Modeling Analytics Simulation Data Integration Layer Core Analytic Database nterprise Data a E Integration La ayer Data Statistical & Mathematical Analysis Storage Mgmt Security/Audit Monitoring Ad-Hoc Query and Reporting Infrastructure and Services System Admin Software Config Accounts Metadata Data Profiling Data Extracts, Matching Web Services Training & Support 5
IRS Analytic Data Environment Compliance Data Warehouse (CDW) Core Database Servers (Sybase IQ, Oracle, SQL Server) Shared Storage (>2PB) (DB, Backup, Staging, User) Application/Web Servers (SAS, R, Hyperion) IRS Network Users & Projects Systems & Applications Analytic Sandboxes Other Tools 6
Scale (Volume) 1600 Data Size (Terabytes) 7000 Average Daily Queries 6000 1200 5000 800 4000 3000 400 2000 1000 0 2005 2007 2009 2011 2013 0 2005 2006 2007 2008 2009 2010 2011 2012 Third-Party Tools Web-Based Not all infrastructure/service costs are constant in scale Massively large environments can have asymmetric challenges Systems & Storage Management ETL & Database Administration Metadata & Web Services Security Audit and Monitring Tools, Training, & Support Analytic Sandboxes 7
Challenges with Scale I/O bottlenecks when data are off-loaded for analytics Single biggest problem for users in massively large environments Strategy: Maximize in-database analytics where possible Finding the optimal mix of ETL tools and techniques This is still where data warehousing costs are highest Strategy: Stay nimble and avoid one-size-fits-all solution Choosing the right database technology Is it performance or scale that s really needed? CDW is largest database in the IRS and still uses columar DB Strategy: Maximize performance for users at smallest O&M cost Storage management Different approach needed in user-based analytic environment Strategy: t Partition file systems based on user intensity it 8
terly Monthly Weekly Daily Annual Quart Data Arrival Rate 2003 2005 2007 2009 2011 2013 Timeliness (Velocity) 140 120 100 80 60 40 20 0 Ingest-Release Latency 2005 2006 2007 2008 2009 2010 2011 2012 Data arrival rates are different from data delivery rates Minimzing this difference is inherently an ETL problem Data Extract/ Feed Validation/ Integration/ Preprocessinprocessing Post- Analysis/ Modeling Interpretation/ Action 9
Challenges with Velocity Larger the data size, longer the processing time Let P ij and S ij = processing time and size of data set i with frequency j, ij = 1, 2,, n The problem is argmin θ ij (P S) ij + ε ij Processing time varies with scale (and complexity) Disturbances ε ij are unavoidable (e.g., server maintenance) Data may require validation, standardization, and cleaning No two data sets are the same Structured vs. unstructured data What is impact of frequent schema changes on data delivery times for structured data? Do skills exist for processing unstructured data at any speed? 10
Heterogeneity (Variety) Sources of IRS Data Types of IRS Data Source Systems and Data Formats Taxpayers Employers Preparers Banks Brokers Non-Profits Interagency Fed/State Treaty Partners Intermediaries Forms Schedules Worksheets Attachments Images Correspondence Transactions Phone Calls Notices Transcripts Mainframe Unix/Linux Windows Databases VSAM Flat Files Applications DB tables Fixed format Hierarchical Delimited Packed decimal XML Plain text Overwhelming majority of IRS data are still structured Most transaction systems are still file-based Challenge: skills needed to parse and analyze text Information extraction and entity resolution techniques (NLP) 11
60000 50000 Metadata and Information Quality Searchable Metadata Framework and Strategy Simple reference model is used to guide consisteny of searchable artifacts 40000 Combination of system, contextual, and application attributes 30000 Controlled vocabulary for key 20000 descriptive elements 10000 0 2005 2006 2007 2008 2009 2010 2011 2012 Columns Columns w ith Metadata Strategy favors basic discoverability rather than systematized collections Data for analytics must be searchable, understandable, and semantically consistent Metadata is the nucleus of any data quality strategy Trust and confidence in data should be invariant to scale 12
Metadata and Information Quality Stages of Metadata Collection Database Flat File VSAM Extract Transform Load Validate Staging DW Roll-Ups Query, Analys sis, Reportin g Source Systems Source Metadata ETL/T Metadata Data Model Metadata Report Metadata Central Metadata Repository Metadata are collected at each stage of the data supply chain 13
Metadata and Information Quality System Metadata Physical properties, data movement, ETL/T, and workflow artifacts Contextual Metadata Attributes, references, and other searchable content Application Metadata Context dependent logic, conditional rules, and dynamic processing Source System Characteristics System properties File or table names Data element names and definitons Data types Transformation rules Cross-references references Target System Properties Table names Column names Data types Indexes Partitions or table spaces Data Attributes Authoritative system Data element name and definiton Availability Data type Join paths Legacy source reference User reviews Links to context-dependent data Publishing Standards Web-based Standard format Hierarchical and free-form search Web-Based Logic Reports and roll-ups Lookup tables URLs and other links External communication Profiling Frequencies Statistical distributions Trend analysis Geographic maps Reviews User ID Table/column reference Feedback 14
Techniques used by IRS analysts Workforce Skills Regression-based methods (GLM, logisitic, quantile, non-linear, proportional hazards) Social network analysis, graph theory Machine learning (neural networks, SVMs, genetic algorithms) Multivariate statistical methods (discriminant analysis, clustering, density estimation, factor analysis) Simulation (Monte Carlo, MCMC, agent-based modeling) Decision trees (CART, CHAID, C5, hybrids) Bayes rules and other classifiers Variance estimation with complex samples 15
Workforce Skills Analysts: Use of advanced SQL techniques to avoid off-loading data for analytics (in-database dtb computing) Understanding and leveraging Open Source tools IT Staff: Literacy in non-traditional computing architectures Support for Open Source tools and analytic databases Ability to quickly build and deploy analytic sandboxes This is different from typical BI/report/dashboard environments Emphasis on algorithms, not just information distribution Key is multi-disciplinary skills Nexus of statistics, computer science, economics, IT 16
Data Privacy and Security IRS analytics are done behind the firewall but data still moves Data off-loaded to laptops, servers, sandboxes External access (Treasury, Congress, universities) Permissions management in shared disk environment Gets more complex with more users and data Security trade-offs and challenges Impact of system- and application-level policy changes How much continuous monitoring and auditing? FISMA and the documentation dilemma Relationship between encryption and performance 17