Big Data Analytics - Accelerated. stream-horizon.com



Similar documents
Big Data Analytics - Accelerated. stream-horizon.com

WHAT S NEW IN SAS 9.4

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

I/O Considerations in Big Data Analytics

HDP Hadoop From concept to deployment.

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Implement Hadoop jobs to extract business value from large and varied data sets

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Success Step 1: Get the Technology Right

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Workshop on Hadoop with Big Data

Testing Big data is one of the biggest

Upcoming Announcements

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Data processing goes big

Performance and Scalability Overview

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Real Time Big Data Processing

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Building Your Big Data Team

Native Connectivity to Big Data Sources in MSTR 10

ASAM ODS, Peak ODS Server and openmdm as a company-wide information hub for test and simulation data. Peak Solution GmbH, Nuremberg

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

GigaSpaces Real-Time Analytics for Big Data

From Spark to Ignition:

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

A Brief Introduction to Apache Tez

Big Data and Industrial Internet

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

TRAINING PROGRAM ON BIGDATA/HADOOP

Putting Apache Kafka to Use!

BIG DATA What it is and how to use?

Aplicações empresariais de elevada performance com Oracle WebLogic e Coherence. Alexandre Vieira Middleware Solutions Team Leader

How to Enhance Traditional BI Architecture to Leverage Big Data

[Hadoop, Storm and Couchbase: Faster Big Data]

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Real-time Data Replication

In-Memory Analytics for Big Data

<Insert Picture Here> Move to Oracle Database with Oracle SQL Developer Migrations

Dominik Wagenknecht Accenture

Performance and Scalability Overview

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

Big Data Technologies Compared June 2014

New Features in Oracle Application Express 4.1. Oracle Application Express Websheets. Oracle Database Cloud Service

Real-Time Analytics for Big Market Data with XAP In-Memory Computing

Virtualizing Apache Hadoop. June, 2012

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Data Challenges in Telecommunications Networks and a Big Data Solution

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

So What s the Big Deal?

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Building a Scalable News Feed Web Service in Clojure

How To Use Big Data For Telco (For A Telco)

In-memory computing with SAP HANA

How To Scale Out Of A Nosql Database

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

The Impact of PaaS on Business Transformation

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Database Scalability and Oracle 12c

Performance Testing of Big Data Applications

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

Has been into training Big Data Hadoop and MongoDB from more than a year now

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Cost-Effective Business Intelligence with Red Hat and Open Source

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

ENTERPRISE EDITION ORACLE DATA SHEET KEY FEATURES AND BENEFITS ORACLE DATA INTEGRATOR

BIG DATA SOLUTION DATA SHEET

Big Data Web Analytics Platform on AWS for Yottaa

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Comprehensive Analytics on the Hortonworks Data Platform

Microsoft Analytics Platform System. Solution Brief

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Roadmap Talend : découvrez les futures fonctionnalités de Talend

NoSQL Data Base Basics

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

TRANSFORM BIG DATA INTO ACTIONABLE INFORMATION

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

Contents Introduction... 5 Deployment Considerations... 9 Deployment Architectures... 11

Using distributed technologies to analyze Big Data

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

MapReduce with Apache Hadoop Analysing Big Data

Data Integration Checklist

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

An Oracle White Paper February Oracle Data Integrator Performance Guide

Hadoop & Spark Using Amazon EMR

Presenters: Luke Dougherty & Steve Crabb

Transcription:

Big Data Analytics - Accelerated stream-horizon.com

Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based on batch (rather than streaming) design paradigms Insufficient level of parallelism of data processing Require specific ETL platform skills High TCO Inefficient software delivery lifecycle Require coding of ETL logic & staff with niche skills High risk & unnecessarily complex release management Vendor lock-in Often require exotic hardware appliances Complex environment management Repository ownership (maintenance & database licenses)

StreamHorizon s adaptiveetl platform - Performance Next generation ETL Big Data processing platform Delivering performance critical Big Data projects Massively parallel data streaming engine Backed with In Memory Data Grid (Coherence, Infinispan, Hazelcast, any other.) ETL processes run in memory & interact with cache (In Memory Data Grid) Unnecessary Staging (I/O expensive) ETL steps are eliminated Quick Time to Market (measured in days)

StreamHorizon s adaptiveetl platform - Effectiveness Low TCO Fully Configurable via XML Requires no ETL platform specific knowledge Shift from 'coding' to 'XML configuration' reduces IT skills required to deliver, manage, run & outsource projects Eliminated 90% manual coding Flexible & Customizable override any default behavior of StreamHorizon platform with custom Java, OS script or SQL implementation No vendor lock-in (all custom written code runs outside StreamHorizon platform-no need to re-develop code if migrating your solution) No ETL tool specific language Out of the box features like Type 0,1,2, Custom dimensions, dynamic In Memory Cache formation transparent to developer

AdaptiveETL enables: Increased data velocity & reduced cost of delivery: Quick Time to Market deploy StreamHorizon & deliver fully functional Pilot of your Data integration project in a single week Total Project Cost / Data Throughput ratio = 0.2 (20% of budget required in comparison with Market Leaders) 1 Hour Proof of Concept download and test-run StreamHorizon s demo Data Warehousing project Big Data architectures supported: Hadoop ecosystem - Fully integrated with Apache Hadoop ecosystem (Spark, Storm, Pig, Hive, HDFS etc.) Conventional ETL deployments - Data processing throughput of 1 million records per second (single commodity server, single database table) Big Data architectures with In Memory Data Grid acting as a Data Store (Coherence, Infinispan, Hazelcast etc.)

AdaptiveETL enables: Achievable data processing throughput: Conventional (RDBMS) ETL deployments - 1 million records per second (single database table, single commodity server, HDD Storage) Hadoop ecosystem (StreamHorizon & Spark) - over 1 million records per second for every node of your cluster File system (conventional & Hadoop HDFS) - 1 million records per second per server (or cluster node) utilizing HDD Storage Supported Architectural paradigms: Lambda Architecture - Hadoop real time & batch oriented data streaming/processing architecture Data Streaming & Micro batch Architecture Massively parallel conventional ETL Architecture Batch oriented conventional ETL Architecture

AdaptiveETL enables: Scalability & Compatibility: Virtualizable & Clusterable Horizontally & Vertically scalable Highly Available (HA) Running on Linux, Solaris, Windows, Compute Clouds (EC2 & others) Deployment Architectures: Runs on Big Data clusters: Hadoop, HDFS, Kafka, Spark, Storm, Hive, Impala and more Runs as StreamHorizon Data Processing Cluster (ETL grid) Runs on Compute Grid (alongside grid libraries like Quant Library or any other)

Targeted Program & Project profiles Greenfield project candidates: Data Warehousing Data Integration Business Intelligence & Management Information OLTP Systems (Online Transactional Processing) Brownfield project candidates: Quickly enable existing Data Warehouses & Databases which: Struggle to keep up with loading of large volumes of data Don t satisfy SLA from query latency perspective Latency Profile: Delivery Profile: <5 days Working Prototypes Quick Time to Market Projects Compliance & Legislation Critical Deliveries Skill Profile Low-Medium Skilled IT workforce Offshore based deliveries Data Volume Profile VLDB (Very Large Databases) Small-Medium Databases Real Time (Low Latency 0.666 microseconds per record - average) Batch Oriented

Industries (not limited to ) Finance - Market Risk, Credit Risk, Foreign Exchange, Tick Data, Operations Telecom - Processing PM and FM data in real time (both radio and core) Insurance - Policy Pricing, Claim Profiling & Analysis Health Activity Tracking, Care Cost Analysis, Treatment Profiling ISP - User Activity Analysis & Profiling, Log Data Mining, Behavioral Analysis

Architecture

High Level Architecture Data Sources StreamHorizon Data Processing Platform Data Targets Databases H Hadoop In-Memory Data Grid Databases H Hadoop Direct Streaming Web Services Data Processing Engine Custom Plugins Engine Data Delivery Engine Direct Streaming Web Services Files & Directories Middleware Messaging Processing Coordinator Files & Directories Middleware Messaging Data Processing Throughput of 1+ Million Records per Second

Benchmarks

Benchmarks Reading Method Writing Method Database Load Type Throughput Deployment Method Instances & Threadpools Comment Benchmark 1 Buffered File Read (JVM Heap) Off Off 2.9 million/sec Local Cluster (single server) 3 instances x 12 ETL threads per instance This is theoretical testing scenario (as data is generally always persisted) Shows StreamHorizon Data Processing Ability without external bottlenecks Eliminates bottleneck introduced by: DB (DB persistence is switched Off) I/O (Writing Bulk files is switched Off) and I/O (Reading Feed files is switched Off) File to File Read File (SAN) Write Bulk File or Any other File (feed) Off 1.86 million/sec Local Cluster (single server) 3 instances x 12 ETL threads per instance Throughput for feed processing with StreamHorizon Local Cluster running on a single physical server This test shows ability to create output feed (bulk file or any other feed (file) for that matter by applying ETL logic to input file (feed) BULK Load Read File (SAN) Write Bulk File (Fact table format) Bulk Load 1.9 million/sec Single Instance (single server) 1 instance x 50 DB Threads Throughput for DB Bulk Load (Oracle External Tables) with single StreamHorizon instance running on a single physical server JDBC Load Read File (SAN) Off JDBC Load 1.1 million/sec Local Cluster (single server) 3 instances x 16 ETL threads Throughput for JDBC Load (Oracle) with StreamHorizon Local Cluster running on a single physical server

Cost - Effectiveness

Functional & Hardware Profile StreamHorizon Market Leaders Horizontal scalability (at no extra cost) Vertical scalability Clusterability, Recoverability & High Availability (at no extra cost) Runs on Commodity hardware* Exotic database cluster licences Not Required Often Required Specialized Data Warehouse Appliances (Exotic Hardware) Not Required Often Required Linux, Solaris, Windows and Compute Cloud (EC2) Ability to run on personal hardware (laptops & workstations)* * - Implies efficiency of StreamHorizon Platform in hardware resource consumption compared to Market Leaders

Cost-Effectiveness Analysis - I Target Throughput of 1million records per second StreamHorizon Market Leaders Hardware Cost 1 unit 3-20 units Time To Market (installation, setup & full production ready deployment)* 4 days 40+ days Throughput (records per hour) ** Throughput (records per second) ** 3.69 billion (Single Engine) 1 million (Single Engine) 1.83 billion (3 Engines) 510K (3 Engines) Requires Human Resources with specialist ETL Vendors skill No Yes Setup and administration solely based on intuitive XML configuration Yes No FTE headcount required to support solution 0.25 2 * - Assuming that file data feeds to be processed by StreamHorizon (project dependency) are available at the time of installation ** - Please refer to end of this document for detailed description of hardware environment

Cost-Effectiveness Analysis - II Hardware Cost ($) Cost of ownership (Man-Days per Month) Specialized Data Warehousing Appliances 10 Market Leading Vendors Market Leading Vendors StreamHorizon 5 StreamHorizon 250K 500K 750K 1m Throughput (records per second) Effort Delivery

Cost-Effectiveness Analysis - III Development (Man-Days) 50 Market Leading Vendors 1 Project (unit) Market Leading Vendors StreamHorizon 25 StreamHorizon Delivery Time To Market (Time from Project inception to Project Delivery) 1 Overall cost-effectiveness ratio comparison based on: Total Costs (licence & hardware) Speed of Delivery Solution Complexity Maintainability

Use Case study

Use Case Study Delivering Market Risk system for Tier 1 Bank Increasing throughput of the system for the factor of 10 Reducing code base from 10,000+ lines of code to 420 lines of code Outsourcing model is now realistic target (delivered solution has almost no code base and is fully configurable via XML) Workforce Savings: Reduced number of FTE & part-time staff engaged on the project (due to simplification) Hardware Savings: $200K of recycled (hardware no longer required) servers (4 in total) Software Licence Savings: $400K of recycled software licences (no longer required due to stack simplification) Total Negative project cost (due to savings achieved in recycled hardware and software licences) BAU / RTB budged reduced for 70% due to reduced implementation complexity & IT stack simplification

Use Case Study - continued Single server acts as both StreamHorizon (ETL Server) and as a database server Single Vanilla database instance delivers query performance better than previously utilized OLAP Cube (MOLAP mode of operation). By eliminating OLAP engine from software stack: o User query latency was reduced o ETL load latency reduced for factor of 10+ o Ability to support number of concurrent users is increased Tier 1 Bank was able to run complete Risk batch data processing on a single desktop (without breaking SLA).

Q&A stream-horizon.com

StreamHorizon Connectivity Map Relational Databases ORACLE MSSQL DB2 Sybase SybaseIQ Teradata MySQL H2 HSQL PostgreSQL Derby Informix Any Other JDBC compliant Non-Relational Data Targets HDFS MongoDB Cassandra Hbase Elastic Search TokuMX Apache CouchDB Cloudata Oracle NoSQL Database Any Other via plugins In Memory Data Grids Coherence Infinispan Hazelcast Any Other via plugins Hadoop ecosystem Spark Storm Kafka TCP Streams Netty Hive Impala Pig Any Other via plugins Messaging JMS (HornetQ, ActiveMQ) AMQP Kafka Any Other via plugins Non-Hadoop file systems Any file system excluding mainframe Acting as Hadoop 2 Hadoop Data Streaming Platform Conventional ETL Data Processing Platform Hadoop 2 Non-Hadoop ETL bridge Non-Hadoop 2 Hadoop ETL bridge