90% of your Big Data problem isn t Big Data.



Similar documents
Get More Scalability and Flexibility for Big Data

Reinventing Business Intelligence through Big Data

The Value of Advanced Data Integration in a Big Data Services Company. Presenter: Flavio Villanustre, VP Technology September 2014

Discovering Business Insights in Big Data Using SQL-MapReduce

Sample Report: LexisNexis RiskView Report

Risk Solutions. Industrial Big Data Analytics Lessons from the trenches Flavio Villanustre LexisNexis Risk Solutions

What's New in SAS Data Management

Analance Data Integration Technical Whitepaper

Big Data and Natural Language: Extracting Insight From Text

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Analance Data Integration Technical Whitepaper

Deploying an Operational Data Store Designed for Big Data

Industry insight into FCRA-compliance and its benefits to healthcare organizations.

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Data makes all the difference.

Running HPCC in a Virtual Machine. Boca Raton Documentation Team

White Paper. High Value Data and Analytics: Building a Platform for Growth

Build a Streamlined Data Refinery. An enterprise solution for blended data that is governed, analytics-ready, and on-demand

Beyond the Data Lake

Luncheon Webinar Series May 13, 2013

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Solution White Paper Connect Hadoop to the Enterprise

Information Architecture

Ganzheitliches Datenmanagement

LexisNexis Insurance Solutions User Guide Interactive/Online Order Processing

The Principles of the Business Data Lake

Data Governance for Regulated Industries

HadoopTM Analytics DDN

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

SEIZE THE DATA SEIZE THE DATA. 2015

How To Use Hp Vertica Ondemand

Big Data at Cloud Scale

IBM System x reference architecture solutions for big data

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

LexisNexis Provider FAQs

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Agile Business Intelligence Data Lake Architecture

Understanding Your Customer Journey by Extending Adobe Analytics with Big Data

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

HP and Business Objects Transforming information into intelligence

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

CA Technologies optimizes business systems worldwide with enterprise data model

Big Data Analytics Nokia

How to Run a Successful Big Data POC in 6 Weeks

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Stopping the Flow of Health Care Fraud with Technology, Data and Analytics

Solving the Problem of Data Silos: Process and Architecture

A Business Case for Fixing Provider Data Issues

Integrated archiving: streamlining compliance and discovery through content and business process management

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

ORACLE FINANCIAL SERVICES ANALYTICAL APPLICATIONS INFRASTRUCTURE

BIG DATA: FIVE TACTICS TO MODERNIZE YOUR DATA WAREHOUSE

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Advanced In-Database Analytics

CONNECTING DATA WITH BUSINESS

Iris KYC. streamline your client on-boarding process

Semarchy Convergence for Data Integration The Data Integration Platform for Evolutionary MDM

An Enterprise Data Hub, the Next Gen Operational Data Store

White Paper. HPCC Systems : Introduction to HPCC (High-Performance Computing Cluster)

Informatica Data Quality Product Family

Parallel Data Warehouse

HPCC Data Handling. Boca Raton Documentation Team

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Harnessing the power of advanced analytics with IBM Netezza

Scalable Enterprise Data Integration Your business agility depends on how fast you can access your complex data

With DDN Big Data Storage

Digital Messaging Platform. Digital Messaging Platform. AgilityHarmony. Orchestrate more meaningful relationships between you and your customers

Managing the Product Value Chain for the Industrial Manufacturing Industry

BANKING ON CUSTOMER BEHAVIOR

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

SQL Server 2005 Features Comparison

SOLUTION WHITE PAPER. Remedyforce Powerful Platform

Extend your analytic capabilities with SAP Predictive Analysis

Provide access control with innovative solutions from IBM.

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Integration Maturity Model Capability #5: Infrastructure and Operations

An Oracle White Paper January Access Certification: Addressing & Building on a Critical Security Control

In-Database Analytics

Transcription:

White Paper 90% of your Big Data problem isn t Big Data. It s the ability to handle Big Data for better insight. By Arjuna Chala Risk Solutions HPCC Systems

Introduction LexisNexis is a leader in providing essential information that helps customers across all industries and government verify identity, assess risk, and predict and detect fraud. To manage, sort, link, and analyze billions of records within seconds, LexisNexis designed a data-intensive supercomputer based on high performance cluster computing (HPCC) technology. The supercomputer, called HPCC Systems, has been proven for the past ten years with customers who need to process large volumes of data in critical 24/7 environments. Customers such as leading banks, insurance companies, utilities, law enforcement and federal government, leverage the HPCC Systems platform technology through various LexisNexis products and services. Originally conceived in response to the need of LexisNexis to manage its own big data challenges, the HPCC Systems platform helped LexisNexis Risk Solutions scale to a $1.4 billion information solutions division, and has evolved to become a mainstream big data analytics system leveraged across many industries. HPCC Systems Introduction The conceptual vision for this computing platform is depicted in the figure below. The HPCC Systems platform was built specifically to analyze large volumes of data in minutes rather than days or months to solve complex problems. The platform can deliver the data and reports to more people and provide fast reporting. Since only small development teams are needed this reduces investment in large teams and keeps processes agile. The HPCC Systems platform has three distinguishing factors that make it an effective choice for big data analytics and processing: The HPCC Systems Data Refinery engine (Thor) helps clean, link, transform and analyze Big Data. Thor supports ETL (Extraction, Transformation and Loading) functions like ingesting unstructured/structured data out, data profiling, data hygiene, and data linking out of the box. In addition, Thor supports flexible record oriented data structures. The HPCC Systems Data Delivery engine (Roxie) provides highly concurrent and low latency real time query capability. The Thor processed data can be accessed by large number of users concurrently in real time fashion using the Roxie. The Roxie queries are typically complex and could include embedded rules logic. The Data Scientist friendly programming language, called Enterprise Control Language (ECL), is used to program both the data processing jobs on Thor and the queries on Roxie. ECL is a declarative, implicitly parallel and data flow oriented programming language that abstracts complex data processing tasks by providing a simple programming interface. 2

Big Data Open Source HPCC Systems and ETL The Problem The typical data ingest process where data is gathered from multiple sources; loaded, cleaned, linked and analyzed is both complex and resources intensive. In the LexisNexis data factory, it has been observed that 60-80% of the effort in trying to analyze or mine data happens in the ETL process. This is where data is ingested from several thousand data sources daily and transformed into cleaned data that can then be used for data mining. Traditional methods of ETL using drag and drop user interfaces fail to address the needs around raw data that is constantly in flux. Manual intervention and remapping of processes would be needed to account for the changes in the data. 3

The Solution The HPCC Systems ETL Platform Multiple raw input files for a source Single Clean Base File THOR Data Sources Landing Zone SALT Data Profiling ETL Data Fabrication Automated ETL Data Profiling Data Hygiene Data Linking Monitoring Built-in Workflow ORBIT Monitoring and Workflow The HPCC Systems ETL platform is built to be flexible and scalable at the same time. The underlying ECL programming language and built in components provides the foundational tooling to build fully automated ETL jobs that can quickly ingest new data and adjust to changes in schema. The key components are described below: THOR Thor (the Data Refinery Cluster) is responsible for consuming vast amounts of data, transforming, linking and indexing that data. It functions as a distributed file system with parallel processing power spread across the nodes. A cluster can scale from a single node to thousands of nodes. ECL ECL (Enterprise Control Language) is the powerful declarative programming language that is ideally suited for the manipulation of Big Data. The declarative nature of the language makes it well suited for a Data Scientist. 4

SALT SALT is an acronym for Scalable Automated Linking Technology. It is a programming environment support tool which functions as an ECL code generator to automatically produce ECL code for a variety of data integration applications, addressing common data processing tasks: data ingest data profiling data hygiene record linking and clustering data source consistency monitoring file comparison to determine delta changes between versions of a data file data parsing and classification ORBIT The ORBIT component provides the monitoring, rules and workflow management. ORBIT monitors all the jobs and reports on job execution status (start, in process, success and failure). Built in alerting will help manage the jobs that fail and notify people to take action. A complete audit trail helps ETL developer quickly identify and fix issues. In addition, ORBIT ensures that all incoming data meets field level criteria s set using rules. 5

For more information: Call 877.316.9669 or visit http://hpccsystems.com About HPCC Systems HPCC Systems from LexisNexis Risk Solutions offers a proven, data-intensive supercomputing platform designed for the enterprise to process and deliver Big Data analytical problems. As an alternative to Hadoop and mainframes, HPCC Systems offers a consistent data-centric programming language, two processing platforms and a single architecture for efficient processing. Customers, such as financial institutions, insurance carriers, insurance companies, law enforcement agencies, federal government and other enterprise-class organizations leverage the HPCC Systems technology through LexisNexis products and services. For more information, visit http://hpccsystems.com. About LexisNexis Risk Solutions LexisNexis Risk Solutions (www.lexisnexis.com/risk/) is a leader in providing essential information that helps customers across all industries and government predict, assess and manage risk. Combining cutting-edge technology, unique data and advanced scoring analytics, we provide products and services that address evolving client needs in the risk sector while upholding the highest standards of security and privacy. LexisNexis Risk Solutions is part of Reed Elsevier, a leading publisher and information provider that serves customers in more than 100 countries with more than 30,000 employees worldwide. LexisNexis and the Knowledge Burst Logo are registered trademarks of Reed Elsevier Properties Inc., used under license. HPCC Systems is a registered trademark of LexisNexis Risk Data Management Inc. Copyright 2012 LexisNexis. All rights reserved. NXR0-0 0712