VelociData Solving the Need for Speed in DataOps. Inside Analysis / Bloor Group Briefing June 13, 2014

Similar documents
Example Use Cases. Solving the Need for Speed in Data Ops. Doc Version 2.1

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Luncheon Webinar Series May 13, 2013

Data processing goes big

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

What's New in SAS Data Management

Data Integration Checklist

Information Architecture

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Big Data Analytics - Accelerated. stream-horizon.com

Integrating data in the Information System An Open Source approach

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Real-time Data Replication

The Future of Data Management

Ganzheitliches Datenmanagement

Deploy. Friction-free self-service BI solutions for everyone Scalable analytics on a modern architecture

Big Data Success Step 1: Get the Technology Right

SQL Server 2012 Business Intelligence Boot Camp

ENTERPRISE EDITION ORACLE DATA SHEET KEY FEATURES AND BENEFITS ORACLE DATA INTEGRATOR

Providing Secure Representative Data Sets

Next-Generation Cloud Analytics with Amazon Redshift

dbspeak DBs peak when we speak

ORACLE DATA INTEGRATOR ENTEPRISE EDITION FOR BUSINESS INTELLIGENCE

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

SOLUTION BRIEF. JUST THE FAQs: Moving Big Data with Bulk Load.

Next Generation Data Warehousing Appliances

Testing Big data is one of the biggest

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

MDM and Data Warehousing Complement Each Other

Oracle Big Data SQL Technical Update

The Data Warehouse ETL Toolkit

High-Volume Data Warehousing in Centerprise. Product Datasheet

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

SAP Data Services 4.X. An Enterprise Information management Solution

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Integrating Ingres in the Information System: An Open Source Approach

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

Course Outline. Module 1: Introduction to Data Warehousing

IBM BigInsights for Apache Hadoop

Cost-Effective Business Intelligence with Red Hat and Open Source

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

SAP Real-time Data Platform. April 2013

Copyr i g ht 2012, SAS Ins titut e Inc. All rights res er ve d. DATA MANAGEMENT FOR ANALYTICS

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

CitusDB Architecture for Real-Time Big Data

Decision Ready Data: Power Your Analytics with Great Data. Murthy Mathiprakasam

SQL Maestro and the ELT Paradigm Shift

Big Data Technologies Compared June 2014

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Best Practices for Hadoop Data Analysis with Tableau

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

White Paper. Unified Data Integration Across Big Data Platforms

Unified Data Integration Across Big Data Platforms

Cloud Integration and the Big Data Journey - Common Use-Case Patterns

Presenters: Luke Dougherty & Steve Crabb

EII - ETL - EAI What, Why, and How!

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

Informatica Data Replication: Maximize Return on Data in Real Time Chai Pydimukkala Principal Product Manager Informatica

Trusted, Enterprise QlikViewreporting. data Integration and data Quality (It s all about data)

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

OWB Users, Enter The New ODI World

Managing Third Party Databases and Building Your Data Warehouse

Cisco Data Preparation

Safe Harbor Statement

WHITE PAPER LOWER COSTS, INCREASE PRODUCTIVITY, AND ACCELERATE VALUE, WITH ENTERPRISE- READY HADOOP

Implementing a Data Warehouse with Microsoft SQL Server 2012

IBM WebSphere DataStage Online training from Yes-M Systems

Green Migration from Oracle

SAP Sybase Replication Server What s New in SP100. Bill Zhang, Product Management, SAP HANA Lisa Spagnolie, Director of Product Marketing

Workday Big Data Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How To Handle Big Data With A Data Scientist

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Data Doesn t Communicate Itself Using Visualization to Tell Better Stories

NEWLY EMERGING BEST PRACTICES FOR BIG DATA

Il mondo dei DB Cambia : Tecnologie e opportunita`

Oracle Data Integration: CON7926 Oracle Data Integration: A Crucial Ingredient for Cloud Integration

Implementing a Data Warehouse with Microsoft SQL Server 2012

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Enabling Better Business Intelligence and Information Architecture With SAP Sybase PowerDesigner Software

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Bringing Big Data into the Enterprise

Constructing a Data Lake: Hadoop and Oracle Database United!

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

The Evolution of ETL

IBM Netezza High-performance business intelligence and advanced analytics for the enterprise. The analytics conundrum

An Overview of SAP BW Powered by HANA. Al Weedman

Lofan Abrams Data Services for Big Data Session # 2987

A TECHNICAL WHITE PAPER ATTUNITY VISIBILITY

Integrating Netezza into your existing IT landscape

Transcription:

VelociData Solving the Need for Speed in DataOps Inside Analysis / Bloor Group Briefing June 13, 2014 1

Transforming Speed and Economics of Data Operations to Achieve Time-Bound Service Levels, Gain Wire-Speed Analytics Advantage and Reduce Cost and Complexity of BI/Analytics Architectures VelociData is a purpose built, data operations micro-supercomputing appliance that is orders of magnitude faster, more scalable, and more cost effective than conventional approaches to data transformation, data quality, data encryption, and data sorting 2

VelociData Hyper-Acceleration Data Operations Hubs Solving Data Integrations Challenges for the Enterprise vfusion = ETL/ELT function offload Addressing performance challenges Improving data quality on ingest Cost avoidance of ETL and analytics platforms build-outs zfusion = vfusion + Mainframe data and data operations offload Reducing mainframe MSU related charges and deferring upgrades Improving data operations performance Migrating and converting mainframe data for new analytics architectures (e.g., Big Data) 3

VelociData Solution Palette VelociData Suites VelociData Solutions Examples Conventional (records/second) VelociData (records/second) Lookup and Replace Data enrichment by populating fields from a master file, dictionary translations, etc. (e.g. CP à Cardiopulmonologist) 3000-6000 600,000 Type Conversions XML à Fixed; Binary à Char; Date/Time Formats 1000-2000 800,000 Format Conversions Rearrange, add, drop, merge, split, and resize fields to change layouts 1000-10,000 650,000 Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000 zfusion vfusion Data Masking USPS Address Processing Domain Set Validation Obfuscate data for non-production uses: Persistent or Dynamic; Format preserving; AES-256 Standardization, verification, and cleansing (CASS certification in process) Validate a value based on a list of acceptable values (e.g., all product codes at a retailer; all countries in the world) 500-10,000 > 1,000,000 600-2000 400,000 1000-3000 750,000 Field Content Validation Validates based on patterns such as emails, dates, and phone numbers 1000-3000 > 1,000,000 Data type validation and bounds checking 3000-6000 > 1,000,000 Accelerated Data Sort Mainframe Data Conversion Sort data using complex sort keys from multiple fields within records Copybook parsing & data layout discovery; EBCDIC, COMP, COMP-3, à ASCII, Integer, Float, Results are system dependent but data intended to provide magnitude comparison 7000-20,000 1,000,000 200-800 > 200,000 9

Common ETL Bottlenecks Extract Transform Load CSV Mainframe XML RDBMS Social Media ETL Server Task #1 Task #2 Task #3 Task #4 Task #5 Task #6 Task #7 Task #8 Candidates for Acceleration Hadoop ETL Server Data Warehouse Database Appliances BI Tools Cloud Sensor Hadoop Staging DB 13

Example ETL Processes Offloaded to VelociData Extract Transform Load CSV Keep Existing Input Interfaces Remove Bottlenecks Reduce ETL Server Workload Faster Total Processing Time Mainframe Hadoop ETL Server ETL Server Data Warehouse XML Task #1 Task #2 Task #6 Task #7 Database Appliances BI Tools Cloud RDBMS Task #3 Task #8 Task #4 Social Media Task #5 Staging DB Sensor Hadoop 6

vfusion Data Operations Acceleration Hub Wire-rate transformations enable fast data access between systems ETL Server Preprocess data for fast movement into and out of Data Integration tools Hadoop Convert data to ASCII and improve quality in flight VelociData feeds Hadoop pre-processed, quality data for real-time BI efforts MPP Platforms (e.g., Netezza) Format and improve data for ready insertion into Data Analytics architectures VelociData enables real-time data access by Netezza for real-time analytics 7

zfusion Data Operations Acceleration Hub Wire-rate transformations enable fast data access between systems ETL Server Preprocess data for fast movement into and out of Data Integration tools Hadoop Convert data to ASCII and improve quality in flight VelociData feeds Hadoop pre-processed, quality data for real-time BI efforts MPP Platforms (e.g., Netezza) Format and improve data for ready insertion into Data Analytics architectures Mainframe Conversion into and out of EBCDIC and packed decimal formats VelociData enables real-time data access by Netezza for real-time analytics 8

Live Demonstration And now for something completely different

Examples of New World Data Challenges Being Solved Property casualty company shortens by 5x a daily task of processing 540 million records to enable more accurate real-time quoting Retailer now integrates full customer data from instore, on-line, and mobile sources in realtime processing 40,000 records/s (up from 100/s) Pharmaceutical discovery query is reduced from 8 days to 20 minutes Health benefits provider shortens a data integration process from 16 hours to 45 seconds to enable better customer support Logistics firm standardizes USPS address at 10 billion / hour for data cleansing on ingest Manufacturer sorts billions of records with multifields keys at nearly M/s for analytics and data quality Credit card company reduces mainframe costs and improves analytics performance by integrating historical and fresh data into Hadoop at line rates Financial processing network masks 5 million fields/s of production data to sell opportunity information to retailers 10

For More Information Please Visit: velocidata.com 11

Thank You!

Questions?

Additional Slides

Today s Update from VelociData Fast access to real time data Addressing Total Flow bottlenecks VelociData s solutions New capabilities Live Demo Questions 15

Enabling Three Layers of Data Access Wire-rate transformations and convergence of fresh and historical data VelociData enables real-time data access for immediate analytics and visualization Sensors Weblogs Transactions Mainframe Hadoop Social Media RDBMS VelociData feeds databases and warehouses pre-analytic, aggregated data for operational analytics VelociData delivers Hadoop pre-processed, quality data to keep the lake clean Hadoop 5

Accessing Realtime and Historical Data Business Excellence Real-time Operational Analytics Realtime Analysis for Competitive Advantage Enabling the speed of business to match business opportunities Integrating Historical Data for Operational Excellence Informing traditional BI with real-time inputs Conventional Batch-oriented BI Iterative Modeling 17

How We Achieve Orders of Magnitude Improvement in Cost-Performance VelociData Data Operations Appliance Micro-supercomputing appliance in a 4U rack form factor Heterogeneous system architecture that includes FPGAs, GPUs, and CPUs with internal parallelism that dramatically outperforms general purpose computers Purpose-built solutions that combine software, firmware, and massively parallel hardware to achieve acceleration approaching wire-rates Pricing and terms - Low-risk subscription terms without upfront license fees - Subscription includes high availability production system, Q/A system, disaster recovery system, unlimited usage, maintenance, support, and updates - Fixed fee for initial installation and services 18

Architecture Price/Performance/Functionality Criteria Transformation Complexity Format conversions Intense Lookups SID Generations Latency SLAs Limited batch window Low latency/ R-T Data Volumes Large data sets (e.g., 10Ms records) High txn volumes (e.g., 100,000s txns/sec) BI/Analytics Architecture Choices ILM Feature Rich Integration Data Quality MDM, PIM, IR Governance Outcome Use the right tool for the job 10

Parallelism in IT Processing is Compelling Amdahl s Law High Performance Computing history Systems were expensive Unique tools and training required Scaling performance is often sub-linear Issues with timing and thread synchronization HPC has struggled for 40 years to deliver widespread accessibility mostly due to cost and poor abstraction, development tools, and design environment If we could just deliver accessibility at an affordable cost Hardware is now becoming inexpensive Application development improvements still needed to enable productivity ü Abstract through implementation of streaming as the paradigm 20

Complementary Approach: Heterogeneous System Architecture Leverage a variety of compute resources Not just parallel threads on identical resources Right resources at the right times Functional elements use appropriate processing components where needed Accommodate stream processing Source à processing à target Streaming data model enables pipelining, data flow acceleration Embrace fine-grained pipeline / functional parallelism Especially data / direct parallelism Separate latency and throughput Engineered system Manage thread, memory, and resource timing and contention 21

Offloading ELT Work from DW/Analytics Systems Offload from Teradata, Netezza, SAP, and other distributed platforms or appliances ETL tasks often migrate into these systems due to available capacity and performance problems (as opposed to true infrastructure design) Sometimes as much as 80% of the target is used to perform T Performing cleansing, transformation, and sort in the load step can offload a tremendous amount of push-down ETL work Land clean, properly formatted data in the initial load Misplaced workload can be right-placed to a purpose-built system Improves overall workflow performance Future-proofs architecture for increasing data volumes and variety Recovers target resources freeing them up to do what they were built to do Often lowers total costs 22

zfusion: Mainframe Function and Data Offload Use Cases BI applications such as operational analytics and dashboards Reduce Mainframe MSU/MIPS costs associated with data processing Joins / aggregations / deduplication outside of DB Data quality / filtering / masking Data movement enabling access to mainframe data (especially to Big Data) Basel II, SOX and other regulatory reporting Mobile applications Unique Capabilities Automatic COBOL copybook processing and layout generation Data conversion between mainframe and x86 formats at mainframe speed Mainframe processing offload (transform, sort, mask, data quality, ) at line rate Some Customer Characteristics Volatile/valuable mainframe data usage that creates unpredictable demands (e.g., financial services) Competitive advantage based on time to insight (e.g., retail, CPG) Highly mainframe TCO conscious with misplaced data integration workloads (e.g., insurance) 23

Example Mainframe-to-Hadoop Workflow Mainframe Input Validation Key Generation Formatter Lookup Address Standardization CSV Out Simple, configuration-driven workflow Sample shows Mainframe à HDFS Along the way data are validated, cleansed, reformatted, enriched, Land analytics-ready data as fast as it can move across the wire Workflow can also work in reverse to return processed data to the mainframe 24

VelociData: Continuous Innovation 1Q14 Accelerated Sorting Operation FTP-Driven Workflows Data Routing 2Q14 Custom Mainframe Type Support Aggregation / Dedupe Transformation Enhancements 2H14 Pre-Analytics Statistics Calculations Expression Transformation Data JOIN Enhanced User Interface Platform Update Customer Quote: What is unique about VelociData is you can prove the story quickly 25

Heterogeneous System Architecture Standard CPUs General purpose not bad at everything - Good branch prediction, fast access to large memory Graphics Boards (GPUs) Thousands of cores performing very specific tasks - Excellent matrix and floating point FPGA Coprocessors Fully customizable with extreme opportunities for parallelism - Excels at bit manipulation for regex, cryptography, searching, 26

Stream Processing AND Hadoop Leveraging stream processing with batch-oriented Hadoop Access to more data for analytics Process data on ingest (also land raw data if desired) Transformation Cleansing Security Never read a COBOL copybook again Stream sort for integrating data, aggregation, and dedupe 27

Integration with ETL Vendors Scripting In Tasks Simplest and quickest integration Call VelociData command-line from scheduling or ETL tools Often used when offloading entire stages of ETL processing Custom Connector Utilize a custom VelociData-built interface into ETL components for tighter mid-job integration Examples include a Buildop in DataStage and a Custom Transform in Informatica GUI Level Integration Tighter integration allowing GUI developers to directly configure and call into VelociData For DataStage this is a Custom Stage; VelociData is working closely with IBM to develop Metadata Incorporation Communicates with existing metadata environment for robust compliance Cooperates with existing data lineage and data governance tools Developing independent metadata strategy 28

Source / Target Connectors Cloudera Impala DB2 Greenplum HDFS Hive Informix Microsoft SQL Server / Azure 2012 MySQL 5.x Oracle 11g PostgreSQL Salesforce.com Sqoop Sybase Sybase IQ Teradata Text Files XML 29

VelociData vs. Database Appliances A single 4U system VelociData Appliance Hardware accelerated data transformation Highly optimized for data flow Custom accelerated operators for ETL tasks Tools and syntax designed for transformations and filtering Real-time conversion of data between disparate systems Database Appliances Typically entire cabinets of equipment Hardware accelerated SQL Highly optimized for storing and retrieving tables Accelerated general SQL tasks Set-oriented syntax for database operations Fast processing of tables that are already resident on the appliance Use the right tool for the job 30

Assisting Workflows by Offloading ELT Data transformation and formatting Transform / convert non-database types (like XML and mainframe) into formatted rows and columns Sort and aggregate data on ingest Compute surrogate keys Join reference data through lookup operations Pivoting or de-pivot to shape data for warehousing Data quality and cleansing Filtering and selecting various data Filter out empty values, bad values, or improperly formatted elements Standardize and regularize data Security Masking data to remove PII / PCI data on the way to the warehouse 31