VelociData Solving the Need for Speed in DataOps. Inside Analysis / Bloor Group Briefing June 13, 2014

VelociData Solving the Need for Speed in DataOps Inside Analysis / Bloor Group Briefing June 13, 2014 1

Transforming Speed and Economics of Data Operations to Achieve Time-Bound Service Levels, Gain Wire-Speed Analytics Advantage and Reduce Cost and Complexity of BI/Analytics Architectures VelociData is a purpose built, data operations micro-supercomputing appliance that is orders of magnitude faster, more scalable, and more cost effective than conventional approaches to data transformation, data quality, data encryption, and data sorting 2

VelociData Hyper-Acceleration Data Operations Hubs Solving Data Integrations Challenges for the Enterprise vfusion = ETL/ELT function offload Addressing performance challenges Improving data quality on ingest Cost avoidance of ETL and analytics platforms build-outs zfusion = vfusion + Mainframe data and data operations offload Reducing mainframe MSU related charges and deferring upgrades Improving data operations performance Migrating and converting mainframe data for new analytics architectures (e.g., Big Data) 3

VelociData Solution Palette VelociData Suites VelociData Solutions Examples Conventional (records/second) VelociData (records/second) Lookup and Replace Data enrichment by populating fields from a master file, dictionary translations, etc. (e.g. CP à Cardiopulmonologist) 3000-6000 600,000 Type Conversions XML à Fixed; Binary à Char; Date/Time Formats 1000-2000 800,000 Format Conversions Rearrange, add, drop, merge, split, and resize fields to change layouts 1000-10,000 650,000 Key Generation Hash multiple field values into a unique key, (e.g. SHA-2) 3000-20,000 > 1,000,000 zfusion vfusion Data Masking USPS Address Processing Domain Set Validation Obfuscate data for non-production uses: Persistent or Dynamic; Format preserving; AES-256 Standardization, verification, and cleansing (CASS certification in process) Validate a value based on a list of acceptable values (e.g., all product codes at a retailer; all countries in the world) 500-10,000 > 1,000,000 600-2000 400,000 1000-3000 750,000 Field Content Validation Validates based on patterns such as emails, dates, and phone numbers 1000-3000 > 1,000,000 Data type validation and bounds checking 3000-6000 > 1,000,000 Accelerated Data Sort Mainframe Data Conversion Sort data using complex sort keys from multiple fields within records Copybook parsing & data layout discovery; EBCDIC, COMP, COMP-3, à ASCII, Integer, Float, Results are system dependent but data intended to provide magnitude comparison 7000-20,000 1,000,000 200-800 > 200,000 9

Common ETL Bottlenecks Extract Transform Load CSV Mainframe XML RDBMS Social Media ETL Server Task #1 Task #2 Task #3 Task #4 Task #5 Task #6 Task #7 Task #8 Candidates for Acceleration Hadoop ETL Server Data Warehouse Database Appliances BI Tools Cloud Sensor Hadoop Staging DB 13

Example ETL Processes Offloaded to VelociData Extract Transform Load CSV Keep Existing Input Interfaces Remove Bottlenecks Reduce ETL Server Workload Faster Total Processing Time Mainframe Hadoop ETL Server ETL Server Data Warehouse XML Task #1 Task #2 Task #6 Task #7 Database Appliances BI Tools Cloud RDBMS Task #3 Task #8 Task #4 Social Media Task #5 Staging DB Sensor Hadoop 6

vfusion Data Operations Acceleration Hub Wire-rate transformations enable fast data access between systems ETL Server Preprocess data for fast movement into and out of Data Integration tools Hadoop Convert data to ASCII and improve quality in flight VelociData feeds Hadoop pre-processed, quality data for real-time BI efforts MPP Platforms (e.g., Netezza) Format and improve data for ready insertion into Data Analytics architectures VelociData enables real-time data access by Netezza for real-time analytics 7

zfusion Data Operations Acceleration Hub Wire-rate transformations enable fast data access between systems ETL Server Preprocess data for fast movement into and out of Data Integration tools Hadoop Convert data to ASCII and improve quality in flight VelociData feeds Hadoop pre-processed, quality data for real-time BI efforts MPP Platforms (e.g., Netezza) Format and improve data for ready insertion into Data Analytics architectures Mainframe Conversion into and out of EBCDIC and packed decimal formats VelociData enables real-time data access by Netezza for real-time analytics 8

Live Demonstration And now for something completely different

Examples of New World Data Challenges Being Solved Property casualty company shortens by 5x a daily task of processing 540 million records to enable more accurate real-time quoting Retailer now integrates full customer data from instore, on-line, and mobile sources in realtime processing 40,000 records/s (up from 100/s) Pharmaceutical discovery query is reduced from 8 days to 20 minutes Health benefits provider shortens a data integration process from 16 hours to 45 seconds to enable better customer support Logistics firm standardizes USPS address at 10 billion / hour for data cleansing on ingest Manufacturer sorts billions of records with multifields keys at nearly M/s for analytics and data quality Credit card company reduces mainframe costs and improves analytics performance by integrating historical and fresh data into Hadoop at line rates Financial processing network masks 5 million fields/s of production data to sell opportunity information to retailers 10

For More Information Please Visit: velocidata.com 11

Thank You!

Questions?

Additional Slides

Today s Update from VelociData Fast access to real time data Addressing Total Flow bottlenecks VelociData s solutions New capabilities Live Demo Questions 15

Enabling Three Layers of Data Access Wire-rate transformations and convergence of fresh and historical data VelociData enables real-time data access for immediate analytics and visualization Sensors Weblogs Transactions Mainframe Hadoop Social Media RDBMS VelociData feeds databases and warehouses pre-analytic, aggregated data for operational analytics VelociData delivers Hadoop pre-processed, quality data to keep the lake clean Hadoop 5

Accessing Realtime and Historical Data Business Excellence Real-time Operational Analytics Realtime Analysis for Competitive Advantage Enabling the speed of business to match business opportunities Integrating Historical Data for Operational Excellence Informing traditional BI with real-time inputs Conventional Batch-oriented BI Iterative Modeling 17

How We Achieve Orders of Magnitude Improvement in Cost-Performance VelociData Data Operations Appliance Micro-supercomputing appliance in a 4U rack form factor Heterogeneous system architecture that includes FPGAs, GPUs, and CPUs with internal parallelism that dramatically outperforms general purpose computers Purpose-built solutions that combine software, firmware, and massively parallel hardware to achieve acceleration approaching wire-rates Pricing and terms - Low-risk subscription terms without upfront license fees - Subscription includes high availability production system, Q/A system, disaster recovery system, unlimited usage, maintenance, support, and updates - Fixed fee for initial installation and services 18

Architecture Price/Performance/Functionality Criteria Transformation Complexity Format conversions Intense Lookups SID Generations Latency SLAs Limited batch window Low latency/ R-T Data Volumes Large data sets (e.g., 10Ms records) High txn volumes (e.g., 100,000s txns/sec) BI/Analytics Architecture Choices ILM Feature Rich Integration Data Quality MDM, PIM, IR Governance Outcome Use the right tool for the job 10

Parallelism in IT Processing is Compelling Amdahl s Law High Performance Computing history Systems were expensive Unique tools and training required Scaling performance is often sub-linear Issues with timing and thread synchronization HPC has struggled for 40 years to deliver widespread accessibility mostly due to cost and poor abstraction, development tools, and design environment If we could just deliver accessibility at an affordable cost Hardware is now becoming inexpensive Application development improvements still needed to enable productivity ü Abstract through implementation of streaming as the paradigm 20

Complementary Approach: Heterogeneous System Architecture Leverage a variety of compute resources Not just parallel threads on identical resources Right resources at the right times Functional elements use appropriate processing components where needed Accommodate stream processing Source à processing à target Streaming data model enables pipelining, data flow acceleration Embrace fine-grained pipeline / functional parallelism Especially data / direct parallelism Separate latency and throughput Engineered system Manage thread, memory, and resource timing and contention 21

Offloading ELT Work from DW/Analytics Systems Offload from Teradata, Netezza, SAP, and other distributed platforms or appliances ETL tasks often migrate into these systems due to available capacity and performance problems (as opposed to true infrastructure design) Sometimes as much as 80% of the target is used to perform T Performing cleansing, transformation, and sort in the load step can offload a tremendous amount of push-down ETL work Land clean, properly formatted data in the initial load Misplaced workload can be right-placed to a purpose-built system Improves overall workflow performance Future-proofs architecture for increasing data volumes and variety Recovers target resources freeing them up to do what they were built to do Often lowers total costs 22

zfusion: Mainframe Function and Data Offload Use Cases BI applications such as operational analytics and dashboards Reduce Mainframe MSU/MIPS costs associated with data processing Joins / aggregations / deduplication outside of DB Data quality / filtering / masking Data movement enabling access to mainframe data (especially to Big Data) Basel II, SOX and other regulatory reporting Mobile applications Unique Capabilities Automatic COBOL copybook processing and layout generation Data conversion between mainframe and x86 formats at mainframe speed Mainframe processing offload (transform, sort, mask, data quality, ) at line rate Some Customer Characteristics Volatile/valuable mainframe data usage that creates unpredictable demands (e.g., financial services) Competitive advantage based on time to insight (e.g., retail, CPG) Highly mainframe TCO conscious with misplaced data integration workloads (e.g., insurance) 23

Example Mainframe-to-Hadoop Workflow Mainframe Input Validation Key Generation Formatter Lookup Address Standardization CSV Out Simple, configuration-driven workflow Sample shows Mainframe à HDFS Along the way data are validated, cleansed, reformatted, enriched, Land analytics-ready data as fast as it can move across the wire Workflow can also work in reverse to return processed data to the mainframe 24

VelociData: Continuous Innovation 1Q14 Accelerated Sorting Operation FTP-Driven Workflows Data Routing 2Q14 Custom Mainframe Type Support Aggregation / Dedupe Transformation Enhancements 2H14 Pre-Analytics Statistics Calculations Expression Transformation Data JOIN Enhanced User Interface Platform Update Customer Quote: What is unique about VelociData is you can prove the story quickly 25

Heterogeneous System Architecture Standard CPUs General purpose not bad at everything - Good branch prediction, fast access to large memory Graphics Boards (GPUs) Thousands of cores performing very specific tasks - Excellent matrix and floating point FPGA Coprocessors Fully customizable with extreme opportunities for parallelism - Excels at bit manipulation for regex, cryptography, searching, 26

Stream Processing AND Hadoop Leveraging stream processing with batch-oriented Hadoop Access to more data for analytics Process data on ingest (also land raw data if desired) Transformation Cleansing Security Never read a COBOL copybook again Stream sort for integrating data, aggregation, and dedupe 27

Integration with ETL Vendors Scripting In Tasks Simplest and quickest integration Call VelociData command-line from scheduling or ETL tools Often used when offloading entire stages of ETL processing Custom Connector Utilize a custom VelociData-built interface into ETL components for tighter mid-job integration Examples include a Buildop in DataStage and a Custom Transform in Informatica GUI Level Integration Tighter integration allowing GUI developers to directly configure and call into VelociData For DataStage this is a Custom Stage; VelociData is working closely with IBM to develop Metadata Incorporation Communicates with existing metadata environment for robust compliance Cooperates with existing data lineage and data governance tools Developing independent metadata strategy 28

Source / Target Connectors Cloudera Impala DB2 Greenplum HDFS Hive Informix Microsoft SQL Server / Azure 2012 MySQL 5.x Oracle 11g PostgreSQL Salesforce.com Sqoop Sybase Sybase IQ Teradata Text Files XML 29

VelociData vs. Database Appliances A single 4U system VelociData Appliance Hardware accelerated data transformation Highly optimized for data flow Custom accelerated operators for ETL tasks Tools and syntax designed for transformations and filtering Real-time conversion of data between disparate systems Database Appliances Typically entire cabinets of equipment Hardware accelerated SQL Highly optimized for storing and retrieving tables Accelerated general SQL tasks Set-oriented syntax for database operations Fast processing of tables that are already resident on the appliance Use the right tool for the job 30

Assisting Workflows by Offloading ELT Data transformation and formatting Transform / convert non-database types (like XML and mainframe) into formatted rows and columns Sort and aggregate data on ingest Compute surrogate keys Join reference data through lookup operations Pivoting or de-pivot to shape data for warehousing Data quality and cleansing Filtering and selecting various data Filter out empty values, bad values, or improperly formatted elements Standardize and regularize data Security Masking data to remove PII / PCI data on the way to the warehouse 31