ADVANCED ANALYTICS AND FRAUD DETECTION THE RIGHT TECHNOLOGY FOR NOW AND THE FUTURE
Big Data
Big Data What tax agencies are or will be seeing! Big Data Large and increased data volumes New and emerging data types/sources New multi-structured data types with unknown relationships that require processing of data regardless of size to discover insights. Examples: web logs, sensor networks, social networks, text. Increased reporting requirements such as Merchant cards (Form 1099-K) and Cost Basis Reporting on Securities Sales (Form 1099-B) Key Points Analyze all the data just not random samples The need for fast processing to detect and prevent fraud Single repository of the data
More s Law (as in more data) We are now looking at ZettaBytes (= 1 trillion gigabytes)
Big Data Challenges are More Than Data Size The Four Axes of Big Data CIOs face significant challenges in addressing the issues surrounding big data New technologies and applications are emerging and should be investigated to understand their potential value. Source: CEO Advisory: Big Data Equals Big Opportunity, Gartner, 31 March 2011.
Data in a Tax Agency Structured and Unstructured Data i.e. Audit Leads Nexus Payments Seller/Retailer Data Big Box Retailers/Corporations Social Media
Data in a Tax Agency Structured and Unstructured Data Call Center Data Web Logs i.e. Audit Leads Nexus Payments Work Papers Customs Data Case Notes Correspondence & Emails
Leveraging data for Taxpayer Education, Compliance and Service Enhancement Humans by nature are social, social media is just an enabler Untapped social network data EVERYWHERE! - Existing consumer/taxpayer transaction data & interaction data - You are not constrained to Twitter and Facebook feeds to obtain TP behavior and/or data What if.. you could determine by applying text analytics that a taxpayer that claimed no income in 2011 bought three motorcycles in 2011 What if.you could be notified a taxpayer claimed he cheated your tax department on a blog, on Facebook, etc?
Statistical Modeling The most powerful method is to use statistical models to assess fraud risk To build a predictive model, you need to identify some historical known cases Clustering can also be used to find cases with similar characteristics. This won t predict fraud, but can identify unusual groupings of cases Various modeling options exist 1.5 1.0 C3 T r a n s a c ti o n s 0.5 0.0-0.5-1.0 C2 C1-1.5-1.5-1.0-0.5 0.0 0.5 1.0 1.5 Login Time Cluster analysis can help find cases that have similar profiles Decision trees can help identify drivers of fraud and high risk cases Response modeling can provide rankings on overall fraud risk
One Analytic Data Solution Strategic & Operational Intelligence Big Data Insight Ad Hoc /OLAP Predictive Analytics Spatial/ Temporal Active Execution Pattern Analysis Path Analysis Graph Analysis SQL Analytics SQL-Map Reduce Analytics Teradata Integrated Data Warehouse Aster Data Analytic Platform Structure Multi-Structure CRM SCM ERP Trans 3 rd Party Web logs Text Social media Machine data
In-Database Analytic Processing Enabling Better, Faster Insight Reporting and OLAP Advanced Analytics Advanced Visualization Text Analytics Parallel Performance
Who is Teradata? Global Leader in Enterprise Data Warehousing Headquartered in Ohio 9,200+ associates Analytic Solutions and Consulting Services The leader in Gartner s Leaders Quadrant since 1999 U.S. publicly-traded software company S&P 500 Member, Listed NYSE: TDC Founded in 1979, public launch in 2007 Global presence and world-class customer list More than 1,300 customers, More than 2,500 installations 28 Federal and State partners Teradata Tax Team Deep tax domain Compliance Customer service Business Intelligence Extended Appliance Family Launched 2008 Simple Powerful Affordable!
GARTNER MAGIC QUADRANT DATA WAREHOUSE DBMS, 2012 Teradata is THE Leader and has been since 1999! 13 Magic Quadrant for Data Warehouse Database Management Systems Mark Beyer, Donald Feinberg, Merv Adrian, Roxanne Edjlali 2/6/12
Teradata Workload-Specific Platform Family 560 1650 2690 4600 66XX Data Mart Appliance Extreme Data Appliance Data Warehouse Appliance Extreme Performance Appliance Active Enterprise Data Warehouse Aster MapReduce Appliance Scalability Up to 12TB Up to 186PB Up to 315TB Up to 18TB Up to 92PB Up to 5PB Workloads Test/ Development or Smaller Data Marts Analytical Archive, Deep Dive Analytic Strategic Intelligence, Decision Support System, Fast Scan Operational Intelligence, Lower Volume, High Performance Strategic & Operational Intelligence, Real Time Update, Active workloads Discovery Platform for Big Data Analytics with embedded SQL MapReduce for new data types & sources 14
The Teradata Difference Scalability Across Multiple Dimensions Data Volume (Raw, User Data) Workload Management Query Concurrency Teradata can Scale Simultaneously Across Multiple Dimensions Driven by Business! Data Freshness Competition Scales One Dimension at the Expense of Others Limited by Technology! Query Complexity Query Freedom Schema Sophistication Query Data Volume 15 8/14/2012 Teradata Confidential
Teradata Database The Foundation Automatic Built-In Functionality Easy Set & G0 Optimization Options Fast Query Performance Quick Time to Value Simple to Manage Responsive to Business Change Powerful, Embedded Analytics Advanced Workload Management Intelligent Scan Elimination Parallel Everything design and smart Teradata optimizer enables fast query execution across platforms Simple set up steps with automatic hands off distribution of data, along with integrated load utilities result in rapid installations DBAs never have to set parameters, manage table space, or reorganize data Fully parallel MPP shared nothing architecture scales linearly across data, users, and applications providing consistent and predictable performance and growth In-database data mining, virtual OLAP/cubes, pre-built and custom application objects (User Defined Functions) drive efficient and differentiated business insight Workload management options by user, application, time of day and CPU exceptions Set and Go options reduce full file scanning (Primary, Secondary, Multi-level Partitioned Primary, Aggregate Join Index, Sync Scan) 16 8/14/2012 Teradata Confidential
Analytical Ecosystem The Ecosystem Is The Warehouse 2650 1650 66XX 560 2650 Aster Data SQL-Map Reduce 66XX
Teradata Aster Unified Big Data Architecture for the Enterprise Engineers Data Scientists Quants Business Analysts Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc. Discovery Platform Integrated Data Warehouse Capture, Store, Refine Audio/ Video Images Text Web & Social Machine Logs CRM SCM ERP
Aster SQL-MapReduce: What Is It and Why It Is Important to In-Database Analytics? Patented Framework for advanced analytics that are hard to define in SQL - Couples SQL (relational) with MapReduce (SQL-MapReduce) - it s invoked from SQL. (automatically parallelized) - Includes library of pre-packaged Analytic Modules Aster Data ncluster App App App App App App SQL SQL- MapReduce Architecture for diverse, embedded analytics processing - Supports custom analytics written in a variety of languages i.e Java Combines SQL & visual tools - Makes MapReduce accessible from SQL/SQL-based tools (std. BI tools).
Ease of Development and Reuse Analytic Foundation : 50+ out-of-the-box modules Modules Path Analysis Discover patterns in rows of sequential data Statistical Analysis High-performance processing of common statistical calculations Relational Analysis Discover important relationships among data Business-ready SQL-MapReduce Functions npath: complex sequential analysis for time series analysis and behavioral pattern analysis Sessionization: identifies sessions from time series data in a single pass over the data Attribution: operator to help ad networks and websites to distribute credit Histogram: function to provide capability of generating Decision Trees: Native implementation of parallel random forests. Approximate percentiles and distinct counts: calculate percentiles and counts within specific variance Correlation: calculation that characterizes the strength of the relation between different data fileds Regression: performs linear or logistic regression between an output variable and a set of input variables Averages: calculate moving, weighted, exponential or volumeweighted averages over a window of data Graph analysis: finds shortest path from a distinct node to all other nodes in a graph Tokenization: splits strings into individual words to assist text processing
Ease of Development and Reuse Analytic Foundation : 50+ out-of-the-box modules Modules Text Analysis Derive patterns in textual data Cluster Analysis Discover natural groupings of data points Data Transformation Transform data for more advanced analysis SQL-MapReduce Analytic Functions Text Processing: counts occurrences of words, identifies roots, & tracks relative positions of words & multi-word phrases Text Partition: analyzes text data over multiple rows Levenshtein Distance: computes the distance between two words k-means: clusters data into a specified number of groupings Canopy: partitions data into overlapping subsets within which k- means is performed Minhash: buckets highly-dimensional items for cluster analysis Basket analysis: creates configurable groupings of related items from transaction records in single pass Collaborative Filter: predicts the interests of a user by collecting interest information from many users Unpack: extracts nested data for further analysis Pack: compress multi-column data into a single column Antiselect: returns all columns except for specified column Multicase: case statement that supports row match for multiple cases
Unified Big Data Architecture for the Enterprise Engineers Data Scientists Quants Business Analysts Java, C/C++, Pig, Python, R, SAS, SQL, Excel, BI, Visualization, etc. Discovery Platform Integrated Data Warehouse Audio/ Video Images Text Web & Social Machine Logs CRM SCM ERP
Aster SQL-MapReduce and Hadoop MapReduce Aster SQL-MapReduce Hadoop MapReduce Customized MapReduce Deployed via SQL-MR and BI and Visualization tools Easy to manage database 50+ Packaged SQL-MapReduce Analytics SQL language of business Integrated Development Environment (IDE) Customized MapReduce Deployed via application code and people File System Batch Processing Requires lots of coding
Aster SQL-MapReduce and Hadoop Aster SQL-MapReduce Hadoop MapReduce Customized MapReduce SELECT * Deployed via SQL-MR and BI FROM npath ( and Visualization tools ON ( ) PARTITION Easy BY to sba_id manage database ORDER 50+ BY datestamp Packaged MODE (NONOVERLAPPING) SQL-MapReduce Analytics PATTERN ('(OTHER_EVENT FEE_EVENT)+') SYMBOLS SQL ( language of event business LIKE '%REVERSE FEE%' AS FEE_EVENT, Integrated Development event NOT LIKE '%REVERSE FEE%' AS Environment (IDE) OTHER_EVENT) RESULT ( ) ) n; Customized MapReduce Deployed via application code and people File System Batch Processing Requires lots of coding
Aster SQL-MapReduce and Hadoop Aster SQL-MapReduce Hadoop MapReduce Customized MapReduce SELECT * Deployed via SQL-MR and BI FROM npath ( and Visualization tools ON ( ) PARTITION Easy BY to sba_id manage database ORDER 50+ BY datestamp Packaged MODE (NONOVERLAPPING) SQL-MapReduce Analytics PATTERN ('(OTHER_EVENT FEE_EVENT)+') SYMBOLS SQL ( language of event business LIKE '%REVERSE FEE%' AS FEE_EVENT, Integrated Development event NOT LIKE '%REVERSE FEE%' AS Environment (IDE) OTHER_EVENT) RESULT ( ) ) n; Customized MapReduce Deployed via application code and people File System Batch Processing Requires lots of coding
Teradata Workload-Specific Platforms 560 1650 2690 4600 66XX Data Mart Appliance Extreme Data Appliance Data Warehouse Appliance Extreme Performance Appliance Active Enterprise Data Warehouse Aster MapReduce Appliance Scalability Up to 12TB Up to 186PB Up to 315TB Up to 18TB Up to 92PB Up to 5PB Workloads Test/ Development or Smaller Data Marts Analytical Archive, Deep Dive Analytic Strategic Intelligence, Decision Support System, Fast Scan Operational Intelligence, Lower Volume, High Performance Strategic & Operational Intelligence, Real Time Update, Active workloads Discovery Platform for Big Data Analytics with embedded SQL MapReduce for new data types & sources
Teradata Aster Solutions Teradata Aster Software Only Teradata Aster Cloud Edition Aster MapReduce Appliance Purpose Complex, High Speed Analytics For Emerging Big Data Teradata Aster ncluster for Amazon Web Services, AppNexus, Dell s Data Cloud and Terremark Integrated Discovery Platform Scalability Flexible Elastic Up to 5PB Sub Segment Massively parallel software solution with embedded SQL- MapReduce analytics for new data types and sources On-demand extreme scaling with no downtime, always-on data cloud availability for high performance nextgeneration analytics for big data Embedded SQL- MapReduce analytics on Teradata hardware.
Value Proposition: Comparing the Aster Appliance vs. Aster Software-Only Customer wants a ready-torun integrated solution with: Teradata Server Management Teradata support Customer wants to use commodity hardware Wants to run in the cloud Who Supports Appliance SW-Only Hardware Teradata Customer Software Teradata Teradata OS Teradata Customer Network Teradata Customer Set up Teradata Customer Issues Teradata Customer
Thank You!! What will you do different TOMORROW? Questions??