Risk Solutions. Industrial Big Data Analytics Lessons from the trenches Flavio Villanustre LexisNexis Risk Solutions

Similar documents
The Value of Advanced Data Integration in a Big Data Services Company. Presenter: Flavio Villanustre, VP Technology September 2014

90% of your Big Data problem isn t Big Data.

Reinventing Business Intelligence through Big Data

How To Handle Big Data With A Data Scientist

The big data revolution

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Presented by: Aaron Bossert, Cray Inc. Network Security Analytics, HPC Platforms, Hadoop, and Graphs Oh, My

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Chapter 7. Using Hadoop Cluster and MapReduce

Manifest for Big Data Pig, Hive & Jaql

Get More Scalability and Flexibility for Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Best Practices for Hadoop Data Analysis with Tableau

Neelesh Kamkolkar, Product Manager. A Guide to Scaling Tableau Server for Self-Service Analytics

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Machine Data Analytics with Sumo Logic

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

From Spark to Ignition:

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Understanding the Value of In-Memory in the IT Landscape

Big Data With Hadoop

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Data Refinery with Big Data Aspects

The Power of Predictive Analytics

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

A Brief Introduction to Apache Tez

Gradient An EII Solution From Infosys

White Paper. HPCC Systems : Introduction to HPCC (High-Performance Computing Cluster)

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Big Data. Fast Forward. Putting data to productive use

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Alfresco Enterprise on Azure: Reference Architecture. September 2014

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

Make the Most of Big Data to Drive Innovation Through Reseach

In-Memory Analytics for Big Data

InfiniteGraph: The Distributed Graph Database

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

What's New in SAS Data Management

The Power of Analysis Framework

Big Data Processing with Google s MapReduce. Alexandru Costan

Are You Ready for Big Data?

Cray: Enabling Real-Time Discovery in Big Data

ANALYTICS BUILT FOR INTERNET OF THINGS

Why is Internal Audit so Hard?

Are You Ready for Big Data?

Life Insurance & Big Data Analytics: Enterprise Architecture

Understanding traffic flow

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Healthcare Payers WHITE PAPER

TORNADO Solution for Telecom Vertical

Integrating Big Data into the Computing Curricula

Complex, true real-time analytics on massive, changing datasets.

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment

ElegantJ BI. White Paper. The Enterprise Option Reporting Tools vs. Business Intelligence

Real Time Big Data Processing

Big Data Analytics - Accelerated. stream-horizon.com

Streaming Big Data Performance Benchmark. for

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

The Sierra Clustered Database Engine, the technology at the heart of

Large-Scale Test Mining

Innovate and Grow: SAP and Teradata

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

High Performance Computing and Big Data: The coming wave.

Concept and Project Objectives

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Hadoop Ecosystem B Y R A H I M A.

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Using In-Memory Computing to Simplify Big Data Analytics

From GWS to MapReduce: Google s Cloud Technology in the Early Days

A New Era Of Analytic

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

Treasury Inspector General Tax Administration (TIGTA)

Massive Cloud Auditing using Data Mining on Hadoop

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

PARALLELS CLOUD STORAGE

Detect & Investigate Threats. OVERVIEW

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Why Big Data in the Cloud?

Smarter Analytics. Barbara Cain. Driving Value from Big Data

BIG DATA-AS-A-SERVICE

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Technical white paper. R you ready? Turning big data into big value with the HP Vertica Analytics Platform and R

Big Data Processing: Past, Present and Future

How To Scale Out Of A Nosql Database

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

VMware vcenter Log Insight User's Guide

Transcription:

Industrial Big Data Analytics Lessons from the trenches Flavio Villanustre LexisNexis

Big Data funnel 2

HPCC Systems Big Data platform Open source distributed data processing and storage architecture Splits problems into pieces to be worked in parallel by commodity servers Data-centric dataflow language (ECL) and higher level abstractions (KEL/SALT/DSP) Big Data language brings the computing to the data SprayResult := RECORD,MAXLENGTH(2048) STRING dfuwuid; STRING externalname; STRING internalname; INTEGER size; END; SprayResult dospray(diralias l) := TRANSFORM STRING fname := l.name[1..length(l.name)-4]; STRING dsname := '~THOR::Patents::' + fname + '::XML'; SELF.externalName := l.name; SELF.size := l.size; SELF.internalName := dsname; SELF.dfuWUID := SprayFunction(l.name, dsname); END; ds1 := dirlist; OUTPUT(COUNT(ds1), NAMED('Files_2_Spray')); ds3 := NOTHOR(PROJECT(ds1, dospray(left))); OUTPUT(ds3,,'~THOR::Patents::SPRAY_Log::' + WORKUNIT); + = Scalable data processing and analytics for subject matter experts A model providing hierarchical abstraction for efficiency and high performance http://hpccsystems.lexisnexis.com/ 3

LexisNexis Technology: Our HPCC Architecture HPCC Platform Data Refinery (Thor) Data Delivery Engine (Roxie) Big Data ESP Analytical Reporting Enterprise Control Language (ECL) 4

LexisNexis Technology: Our HPCC Architecture Thor Massively Parallel Extract Transform and Load (ETL) engine Designed for data parallel execution. Leverages inexpensive locally attached storage and moves compute functions to the data. Programmable using a dataflow model where nodes are activities and edges represent datasets flowing. Enables data integration on a scale not previously available: Current LexisNexis core data linking process generates hundreds of billions of intermediate results at peak Ideal for: Massive joins/merges Large-scale sorts & transformations Model training over very large high dimensional corpus 5

LexisNexis Technology: Our HPCC Architecture Roxie A massively parallel, high throughput, structured query response engine Low latency queries with high fault tolerance Leverages distributed data and keys for scalable storage and retrieval Automatically exposes queries through SOAP, RESTful, JSON and JDBC interfaces Ideal for Highly concurrent multi-user applications Full text ranked Boolean search On-the-fly regression models Multi-dimensional and hierarchical real-time grouping and filtering of large datasets 6

LexisNexis Technology: Our HPCC Architecture ECL An easy to use, data-centric programming language optimized for large-scale data management and query processing Provides a dataflow programming model: activities represented as nodes and datasets flow through the edges Highly efficient: automatically distributes workload across all nodes. Compiles to C++. Abstracts out the underlying details of parallelism, data distribution, redundancy, etc. Large library of efficient modules to handle common data manipulation tasks 7

Application of HPCC: Insurance Collusion (1 of 2) Scenario This view of carrier data shows seven known fraud claims and an additional linked claim. The Insurance company data only finds a connection between two of the seven claims, and only identified one other claim as being weakly connected. 8

Application of HPCC: Insurance Collusion (2 of 2) Task After adding the Link ID to the carrier Data, LexisNexis HPCC technology then added 2 additional degrees of relative Family 1 Result The results showed two family groups interconnected on all of these seven claims. Family 2 The links were much stronger than the carrier data previously supported. 9

Case Study: Network Traffic Analysis in Seconds Scenario Conventional network sensor and monitoring solutions are constrained by inability to quickly ingest massive data volumes for analysis - 15 minutes of network traffic can generate 4 Terabytes of data, which can take 6 hours to process - 90 days of network traffic can add up to 300+ Terabytes Task Drill into all the data to see if any US government systems have communicated with any suspect systems of foreign organizations in the last 6 months - In this scenario, we look specifically for traffic occurring at unusual hours of the day Result In seconds, HPCC Systems sorted through months of network traffic to identify patterns and suspicious behavior 10 10

Application of HPCC: Beacon Analysis HPCC powers sophisticated near real time analytics to address rapidly evolving threats. Reading the chart Each bubble represents a group of communications between 2 points that occurred more than once. Suspicious activity would be to the far right with low standard deviation, i.e., highly regular at long intervals to reduce likelihood of detection. As threats get more sophisticated, the standard deviation will likely increase and other attributes will change as well. Design and Benchmark Testing Chart key Horizontal axis: Vertical axis: Bubble size: time on a logarithmic scale standard deviation (in hundredths) number of observed transmissions Sort 90 Billion Records in ~2 hours on 400 node systems Supported 14 dimensional queries with ~2 second response Provided scalable design to accommodate 300TB (90 days) of netflow data w/70 billion new records per day 11

Application of HPCC: Network Detection Show me all the traffic over last 6 months leaving the network going to systems known to be part of BotNets... 12

Case Study: Fraud in Medicaid Scenario Proof of concept for Office of the Medicaid Inspector Generation (OMIG) of large Northeastern state. Social groups game the Medicaid system which results in fraud and improper payments. Task Given a large list of names and addresses, identify social clusters of Medicaid recipients living in expensive houses, driving expensive cars. Result Interesting recipients were identified using asset variables, revealing hundreds of high-end automobiles and properties. Leveraging the Public Data Social Graph, large social groups of interesting recipients were identified along with links to provider networks. The analysis identified key individuals not in the data supplied along with connections to suspicious volumes of property flipping potentially indicative of mortgage fraud and money laundering 13 13

Scalable Automated Linking Technology (SALT) 42 Lines of SALT A Highly Efficient Programming Process Domain Specific Language compiles to ECL Provides automated data profiling, parsing, cleansing, normalization and standardization Sophisticated specificity based linking and clustering 3,980 Lines of ECL Data Sources Profiling Parsing Cleansing Normalization Standardization Data Preparation Processes (ETL) Matching Weights and Threshold Computation Blocking/ Searching Weight Assignment and Record Comparison Record Match Decision Linked Data File 482,410 Lines of C++ Record Linkage Processes Additional Data Ingest Linking Iterations 14

Scalable Automated Linking Technology (SALT) Calculates record matching field weights based on term specificity and matching weights o What is the chance that two records for John Smith refer to the same person? How about Flavio Villanustre? It also takes into account transitive relationships o o o What if these two records for John Smith were already linked to Flavio Villanustre? How many John Smiths does Flavio Villanustre know? Are these two John Smith records referring to the same person now? 15

SALT: Business Linking 4 We ve patented a unique, multi-dimensional linking model for businesses 5 123 Main Street 456 Oak Street 789 Maple Street ABC ABC 2 Realty Bakery 2 1 ABC Bakery ABC Realty ABC Bakery 1 1 1 1 2 ABC Realty 1 3 ABC Delivery 1 ABC Delivery 1 3 3 ABC Delivery 1 1 A business entity at an address (e.g., the red dot at one building/address) 2 The legal entity (combination of all three red dots across all three addresses) 3 A collection of separate legal entities that appear to be related that are located at an address. 4 The combination of the related legal entities (all red, blue, and gray dots across all addresses) 5 The roll-up of combined org entities into a larger entity Business Extended Family 16

Large And Scale Do Aggregated You Know & Structured Group Behavior Analysis Measuring the context of a single data-point within the ever changing stream of economies, geographies, social group, crisis, time and more. Grouping Mechanisms Social, Business, Stereotypical, Time/Crisis Aggregated Group Behavior average age, gender, income within a social group # houses owned in a social group # customers within a social group # home owners in a neighbourhood in default # derogatory risk within a social group Structured Group Behavior # properties sold between people in the same social group # sold from a social group where the next owner resulted in 1 st payment default # prescriptions within a social group where prescriber and patient are in the same social group # people who have moved between the same neighborhoods within the same time frame # change over time of derogatory risk within a social group Graph Analysis Overview Address Risk Analysis Geo-Social Analysis 17 17

Graph And Analysis Do You Overview Know Social Graph Shared Historical Addresses 12 Billion + 4 Billion Relationships Shared Assets (Property, Vehicles, Watercraft, etc.) 700 million + Property Deeds Shared Business Ownership 1.5 Billion + 18 18

Graph And Analysis Do You Financial Know Services Customer Example #1 Background Financial Services company hypothesized that organized groups were targeting them and desired to tackle bust-out fraud at a social level. Although some of the risk was not individually large at the account level, the company wanted to ascertain where that risk was growing socially without them realizing that those accounts were connected to each other. Data and Analysis Provided LexisNexis Risk with a 5 million accounts flagged with a combination of active, known fraud, charge offs and preemptively closed tags. For all 5 million accounts, investigate every historical address, every asset and every business for the person attached to the account to find all the people they connect to; then for those people do the same to find the people that they connect to and count the number of those consumers that are also on the original customer list and are either active accounts or known fraud. LexisNexis Approach Address Standardization and LexID the input. Join the linked input to the LexisNexis Social Graph (5 Million input to 4 Billion relationships). Calculate grouped social aggregates (how many of each type of flagged account is within each account s network neighborhood). Score and rank order the result. 19 19

Graph And Analysis Do You Results Know Overview Known Fraud Charge Off Preemptive close # within 1 degree #accounts 0 4055099 1 39266 2 638 3 31 4 2 # within 2 degrees #accounts 0 4079930 1 14894 2 212 31 Accounts associated with 3 known fraud accounts (one of which could be the known fraud itself) 212 Accounts associated with 2 Charge off accounts 3 Accounts associated with 15 Preemptive Closes (one of which could be a preemptive close itself) 20 20 20

Graph And Analysis Do You Account Know Overview Aggregated Group Behavior variables for a single account Flatten the graph Calculate Aggregate Group Behavior measurements Drive predictive analytics at a granular level using graph variables Small Network Fairly tightly connected to each other Active account with 3 active accounts within the network 9 accounts within that network that are known fraud 1 account within the network that has been preemptively closed 21 21 21

Graph And Analysis Do You Account Know Network Visualization 22 22

Geo-Social And Do Analysis You Know Use Case #1 Background Consumer applies for store credit at a store 600 miles from their home. Consumer ships to an address not associated with their identity and it is a significant distance from their current address. Patient fills a prescription outside of their normal geographic location. Tax payer files a refund request from an address not associated with the identity. Geo-social footprint of a person visualized These events tend to generate high levels of false positives. Geo-Social Filtering Create a complex distance calculation from the event location to the location of the relatives and associates within 2 degrees of the consumer. Filter out any events that are within a threshold distance of the consumer or one of their relatives and associates. Powerful method of reducing false positives when calculating risk based on geographic distance of an event to the consumer. 23 23 23

Geo-Social And Do Analysis You Know Use Case #2 Background Organized network commits series of crimes across the country remaining largely untraceable to traditional detection approaches. Syndicates using family members to steal identity information across the United States. Tax Fraud syndicate ships fraudulent tax refunds to addresses near their residences Geo-Social Matching Create a complex distance calculation from the crime location to the location of every recent address for 270 million active identities. Calculate Aggregate Group Behavior variables to rank order which social groups are closely tied geo-socially to the crime locations. Powerful method of leveraging large data and graph analysis to tackle organized crime. 24 24 24

Resources Open Source HPCC Systems Platform: http://hpccsystems.com Free Online Training: http://learn.lexisnexis.com/hpcc SALT: http://hpccsystems.com/salt KEL: https://hpccsystems.com/download/free-modules/kel-lite Machine Learning portal: http://hpccsystems.com/ml The HPCC Systems blog: http://hpccsystems.com/blog Community Forums: http://hpccsystems.com/bb Source Code: https://github.com/hpcc-systems WIKI: https://wiki.hpccsystems.com 25