How To Choose A Data Flow Pipeline From A Data Processing Platform

Similar documents
The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

THE DEATH OF TRADITIONAL DATA INTEGRATION

Hadoop & Spark Using Amazon EMR

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Unified Big Data Processing with Apache Spark. Matei

Apache Flink Next-gen data analysis. Kostas

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Big Data Analytics - Accelerated. stream-horizon.com

Ali Ghodsi Head of PM and Engineering Databricks

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

The Future of Data Management

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Deploying an Operational Data Store Designed for Big Data

A Checklist for Big Data Success. Steve Ackley Sandeep Kumar Gupta

Moving From Hadoop to Spark

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

How Companies are! Using Spark

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Azure Data Lake Analytics

Introduction to Spark

Next-Generation Cloud Analytics with Amazon Redshift

Big Fast Data Hadoop acceleration with Flash. June 2013

Managing large clusters resources

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

From Spark to Ignition:

How To Write A Data Processing Pipeline In R

Virtualizing Apache Hadoop. June, 2012

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Analytics OverOnline Transactional Data Set

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Hadoop Ecosystem B Y R A H I M A.

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA ANALYTICS For REAL TIME SYSTEM

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

What s New in MATLAB and Simulink

CS 378 Big Data Programming

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

White Paper: Enhancing Functionality and Security of Enterprise Data Holdings

Databricks. A Primer

Big Data Analytics. Lucas Rego Drumond

Proact whitepaper on Big Data

Hadoop in the Enterprise

COURSE CONTENT Big Data and Hadoop Training

A Brief Introduction to Apache Tez

An Open Source Memory-Centric Distributed Storage System

A Performance Analysis of Distributed Indexing using Terrier

Oracle Big Data SQL Technical Update

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

BIG DATA HADOOP TRAINING

Hadoop & SAS Data Loader for Hadoop

Using Data Mining and Machine Learning in Retail

SQL Server In-Memory by Design. Anu Ganesan August 8, 2014

Databricks. A Primer

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Ganzheitliches Datenmanagement

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Solution White Paper Connect Hadoop to the Enterprise

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Luncheon Webinar Series May 13, 2013

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

CS 378 Big Data Programming. Lecture 2 Map- Reduce

How To Create A Data Visualization With Apache Spark And Zeppelin

The Stratosphere Big Data Analytics Platform

Scaling Out With Apache Spark. DTL Meeting Slides based on

Complex, true real-time analytics on massive, changing datasets.

Big Data With Hadoop

Big Data and Market Surveillance. April 28, 2014

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

How To Use Big Data For Telco (For A Telco)

YARN Apache Hadoop Next Generation Compute Platform

G-Cloud Big Data Suite Powered by Pivotal. December G-Cloud. service definitions

Productionizing a 24/7 Spark Streaming Service on YARN

An Approach to Implement Map Reduce with NoSQL Databases

Dell Cloudera Syncsort Data Warehouse Optimization ETL Offload

CSE-E5430 Scalable Cloud Computing. Lecture 4

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Scalable Architecture on Amazon AWS Cloud

I/O Considerations in Big Data Analytics

Hadoop: The Definitive Guide

Making Sense of Big Data in Insurance

Transcription:

S N A P L O G I C T E C H N O L O G Y B R I E F SNAPLOGIC BIG DATA INTEGRATION PROCESSING PLATFORMS 2 W Fifth Avenue Fourth Floor, San Mateo CA, 94402 telephone: 888.494.1570 www.snaplogic.com

Big Data Integration Processing Platforms One of our goals at SnapLogic is to match data flow execution requirements with an appropriate execution platform. Different data platforms have different benefits. This paper reviews the nature of SnapLogic data flow pipelines and how to choose an appropriate data platform. In addition to categorizing pipelines, this paper also explains our currently supported execution targets and our planned support for Apache Spark. First, some preliminaries. All data processed by SnapLogic pipelines is handled natively in an internal JSON format. We call this document-oriented processing. Even flat, record-oriented data is converted into JSON for internal processing. This lets us handle both flat and hierarchical data seamlessly. Pipelines are constructed from Snaps. Each Snap encapsulates specific applications or technology functionality. The Snaps are connected together to carry out a data flow process. Pipelines are constructed with our visual designer. Some Snaps provide connectivity, such as connecting to databases or cloud application. Some Snaps allow for data transformation such as filtering out documents, adding or removing fields or modifying fields. We also have snaps that perform more complex operations such as sort, join and aggregate. Given this setup, we can categorize pipelines into two types: streaming and accumulating. In a streaming pipeline, documents can flow independently. The processing of one document is not dependent on another document as they flow through the pipeline. Such streaming pipelines have low memory requirements because documents can exit the pipeline once they have reached the last Snap. In contrast, an accumulating pipeline requires that all documents from the input source must be collected before result documents can be emitted from a pipeline. Pipelines with sort, join and aggregate are accumulating pipelines. In some cases, a pipeline can be partially accumulating. Such accumulating pipelines can have high memory requirements depending on the number of documents coming in from an input source. 2

Now let s turn to execution platforms. SnapLogic has an internal data processing platform called a Snaplex. Think of a Snaplex as a collection of processing nodes or containers that can execute SnapLogic pipelines. We have a few flavors of Snaplexes: A Cloudplex is a Snaplex that we host in the cloud and it can autoscale as pipeline load increases. A Groundplex is a fixed set of nodes that are installed onpremises or in a customer VPC. With a Groundplex, customers can do all of their data processing behind their firewall so that data does not leave their infrastructure. We are also expanding our support for external data platforms. We have recently released our Hadooplex technology that allows SnapLogic customers to use Hadoop as an execution target for SnapLogic pipelines. A Hadooplex leverages YARN to schedule Snaplex containers on Hadoop nodes in order to execute pipelines. In this way, we can autoscale inside a Hadoop cluster. Recently we introduced SnapReduce 2.0, which enables a Hadooplex to translate SnapLogic pipelines into MapReduce jobs. A user builds a designated SnapReduce pipeline and specifies HDFS files and input and output. These pipelines are compiled to MapReduce jobs to execute on very large data sets that live in HDFS. (Check out the demonstration in our recent cloud and big data analytics webinar.) Finally, as we announced as part of Cloudera s real-time streaming announcement, we ve begun work on our support for Spark as a target big data platform. A Sparkplex will be able to utilize SnapLogic s extensive connectivity to bring data into and out of Spark RDDs (Resilient Distributed Datasets). In addition, similar to SnapReduce, we will allow users to compile SnapLogic pipelines into Spark codes so the pipelines can run as Spark jobs. We will support both streaming and batch Spark jobs. By including Spark in our data platform support, we will give our customers a comprehensive set of options for pipeline execution. Choosing the right big data platform will depend on many factors: data size, latency requirements, connectivity and pipeline type (streaming versus accumulating). Here are some guidelines for choosing a particular big data integration platform: 3

Cloudplex Cloud-to-cloud data flow Accumulating pipelines in which accumulated data can fit into node memory Groundplex Accumulating pipelines in which accumulated data can fit into node memory Hadooplex Accumulating pipelines can operate on arbitrary data sizes via MapReduce Sparkplex Allow for Spark connectivity to all SnapLogic accounts Accumulating pipelines can operate on data sizes that can fit in Spark cluster memory 4

Note that recent work in the Spark community has increased support for out-of-core computations, such as sorting. This means that accumulating pipelines that are currently only suitable for MapReduce execution may be supported in Spark as out-of-core Spark support becomes more general. The Hadooplex and Sparkplex have added reliable execution benefits so that long-running pipelines are guaranteed to complete. At SnapLogic, our goal is to allow customers to create and execute arbitrary data flow pipelines on the most appropriate data platform. In addition, we provide a simple and consistent graphical UI for developing pipelines which can then execute on any supported platform. Our platform agnostic approach decouples data processing specification from data processing execution. As your data volume increases or latency requirements change, the same pipeline can execute on larger data and at a faster rate just by changing the target data platform. Ultimately, SnapLogic allows you to adapt to your data requirements and doesn t lock you into a specific big data platform. About SnapLogic SnapLogic connects enterprise data and applications in the cloud and on-premises for faster decision-making and improved business agility. With the SnapLogic Elastic Integration Platform, organizations can more quickly and affordably accelerate the cloudification of enterprise IT with fast, multi-point and modern connectivity of big data, applications and things. Funded by leading venture investors, including Andreessen Horowitz and Ignition Partners, and cofounded by Gaurav Dhillon, cofounder and former CEO of Informatica, SnapLogic is utilized by prominent companies in the Global 2000. For more information about SnapLogic, visit www.snaplogic.com. 5