Big Data Architecture

Similar documents
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

WELCOME. Where and When should I use the Oracle Service Bus (OSB) Guido Schmutz. UKOUG Conference

Unified Batch & Stream Processing Platform

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

How To Create A Data Visualization With Apache Spark And Zeppelin

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data and Fast Data combined is it possible?

Architectures for massive data management

Architecting for the Internet of Things & Big Data

Real Time Data Processing using Spark Streaming

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

The Internet of Things and Big Data: Intro

Hadoop Ecosystem B Y R A H I M A.

Integrating a Big Data Platform into Government:

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data Analytics Hadoop and Spark

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

#TalendSandbox for Big Data

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Dell In-Memory Appliance for Cloudera Enterprise

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

From Spark to Ignition:

Next-Gen Big Data Analytics using the Spark stack

April 2016 JPoint Moscow, Russia. How to Apply Big Data Analytics and Machine Learning to Real Time Processing. Kai Wähner.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Ali Ghodsi Head of PM and Engineering Databricks

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Hadoop & Spark Using Amazon EMR

How Companies are! Using Spark

BIG DATA ANALYTICS For REAL TIME SYSTEM

Reference Architecture, Requirements, Gaps, Roles

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

The Future of Data Management

Unified Big Data Processing with Apache Spark. Matei

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data and Industrial Internet

Big Data and Analytics in Government

WHITE PAPER. Building Big Data Analytical Applications at Scale Using Existing ETL Skillsets INTELLIGENT BUSINESS STRATEGIES

Real-time Big Data Analytics with Storm

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

IOT & Big Data: The Future Information Processing Architecture

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Where is... How do I get to...

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Information Builders Mission & Value Proposition

SOUG-SIG Data Replication With Oracle GoldenGate Looking Behind The Scenes Robert Bialek Principal Consultant Partner

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Oracle Service Bus vs. Oracle Enterprise Service Bus vs. BPEL wann soll welche Komponente eingesetzt werden?

Unified Big Data Analytics Pipeline. 连 城

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data Analysis: Apache Storm Perspective

An Oracle White Paper June Oracle: Big Data for the Enterprise

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Talend Big Data. Delivering instant value from all your data. Talend

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Are You Big Data Ready?

NoSQL for SQL Professionals William McKnight

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Real Time Big Data Processing

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Data Services Advisory

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Introduction to Big Data Training

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Large scale processing using Hadoop. Ján Vaňo

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data. Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott Trivadis AG Trivadis. Big Data - Marriage of RDBMS-DWH and Hadoop & Co.

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

TRAINING PROGRAM ON BIGDATA/HADOOP

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Ganzheitliches Datenmanagement

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Moving From Hadoop to Spark

The Future of Data Management with Hadoop and the Enterprise Data Hub

Evolution to Revolution: Big Data 2.0

How To Use Big Data For Telco (For A Telco)

HDP Hadoop From concept to deployment.

Understanding traffic flow

The Lab and The Factory

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data Analytics Nokia

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Big Data. Lyle Ungar, University of Pennsylvania

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

An Oracle White Paper October Oracle: Big Data for the Enterprise

Big Data Research in the AMPLab: BDAS and Beyond

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Dominik Wagenknecht Accenture

Transcription:

Big Architecture Guido Schmutz BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Guido Schmutz Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big / Fast Member of Trivadis Architecture Board Technology Manager @ Trivadis More than 25 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Twitter: gschmutz

Agenda 1. Introduction 2. Traditional Architecture for Big 3. Streaming Analytics Architecture for Fast 4. Lambda/Kappa/Unifed Architecture for Big 5. Summary

Introduction

Big is still work in progress Choosing the right architecture is key for any (big data) project Big is still quite a young field and therefore there are no standard architectures available which have been used for years In the past few years, a few architectures have evolved and have been discussed online Know the use cases before choosing your architecture To have one/a few reference architectures can help in choosing the right components

Hadoop Ecosystem many choices. Unstructured Sources Analytics SQL on Hadoop Management / Monitoring Workflow/Job Structured Sources Core Storage Serialization Security

Important Properties to choose a Big Architecture Latency Keep raw and un-interpreted data forever? Volume, Velocity, Variety, Veracity Ad-Hoc Query Capabilities needed? Robustness & Fault Tolerance Scalability

From Volume and Variety to Velocity Big has evolved Past Big = Volume & Variety Present Big = Volume & Variety & Velocity and the Hadoop Ecosystem as well. Past Batch Processing Time to insight of Hours Present Batch & Stream Processing Time to insight in Seconds Adapted from Cloudera Blog article

Traditional Architecture for Big

Traditional Architecture for Big Sources Collection (Analytical) Processing Result Storage Access Logfiles ERP RDBMS Social Sensor Stage Channel Raw (Reservoir) Batch compute Computed Information Query Engine Reports Service Analytic Machine Alerting Mobile = in Motion = at Rest

Use Case 1) Click Stream analysis: 360 degree view of customer Sources Collection (Analytical) Processing Access Reports Logfiles Channel Raw (Reservoir) Batch compute Computed Information Query Engine Analytic

Use Case 2) Ingest Relational into Hadoop and make it accessible Sources Collection (Analytical) Processing Access Reports RDBMS Raw (Reservoir) Batch compute Computed Information Query Engine Service Analytic Alerting

Use Case 2a) Ingest Relational into Hadoop and make it accessible Sources Collection (Analytical) Processing Access Reports RDBMS (CDC) Channel Raw (Reservoir) Batch compute Computed Information Query Engine Service Analytic Alerting

Hadoop Ecosystem Technology Mapping Sources Collection (Analytical) Processing Result Storage Access Logfiles ERP RDBMS Social Sensor Staging Channel Raw (Reservoir) Batch compute Computed Information Query Engine Reports Service Analytic Machine Alerting Mobile = in Motion = at Rest

Apache Spark the new kid on the block Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big! Originally developed 2009 in UC Berkley s AMPLab Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk One of the largest OSS communities in big data with over 200 contributors in 50+ organizations Open Sourced in 2010 since 2014 part of Apache Software foundation Supported by many vendors

Motivation Why Apache Spark? Apache Hadoop MapReduce: Sharing on Disk HDFS read HDFS write HDFS read HDFS write map reduce... Input Output Apache Spark: Speed up processing by using Memory instead of Disks Input op1 op2... Output

Apache Spark Ecosystem Libraries Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLlib, Spark R (Machine Learning) GraphX (Graph Processing) Core Runtime Spark Core API and Execution Model Cluster Resource Managers / Stores Spark Standalone MESOS YARN HDFS Elastic Search NoSQL S3

Use Case 3) Predictive Maintenance through Machine Learning on collected data Sources Collection (Analytical) Processing Access File Staging Reports DB Machine Channel Raw (Reservoir) Batch compute Computed Information Query Engine Service Analytic Alerting

Spark Ecosystem Technology Mapping Sources Collection (Analytical) Processing Access Logfiles ERP RDBMS Social Sensor Stage Channel Raw (Reservoir) Batch compute Computed Information Query Engine Reports Service Analytic Machine Alerting Mobile = in Motion = at Rest

Traditional Architecture for Big Batch Processing Not for low latency use cases Spark can speed up, but if positioned as alternative to Hadoop Map/ Reduce, it s still Batch Processing Spark Ecosystems offers a lot of additional advanced analytic capabilities (machine learning, graph processing, )

Streaming Analytics Architecture for Big

Streaming Analytics Architecture for Big aka. (Complex) Event Processing) Sources ERP RDBMS Collection (Analytical) Real-Time Processing Info Delivery Access Reports Logfiles Social Sensor Machine Channel Stream/Event Processing Batch compute Messaging Service Analytic Alerting Mobile = in Motion = at Rest

Use Case 4) Alerting in Internet of Things (IoT) Sources Collection (Analytical) Real-Time Processing Info Delivery Access Sensor Machine Channel Stream/Event Processing Batch compute Analytic Messaging Alerting

Use Case 5) Real-Time Analytics on Sensor Events Sources Collection (Analytical) Real-Time Processing Info Delivery Access Sensor Machine Channel Stream/Event Processing Batch compute Analytic Messaging

Unified Log (Event) Processing Stream processing allows for computing feeds off of other feeds Meter Readings Collector Raw Meter Readings Derived feeds are no different than original feeds they are computed off Customer Enrich / Transform Meter with Customer Aggregate by Minute Meter by Minute Persist Raw Meter Readings Single deployment of Unified Log but logically different feeds Aggregate by Minute Persist Meter by Customer by Minute Meter by Minute

Streaming Analytics Technology Mapping Sources Collection (Analytical) Real-Time Processing Info Delivery Access ERP RDBMS Reports Logfiles Social Sensor Machine Channel Stream/Event Processing Batch compute Messaging Service Analytic Alerting Mobile = in Motion = at Rest

Streaming Analytics Architecture for Big The solution for low latency use cases Process each event separately => low latency Process events in micro-batches => increases latency but offers better reliability Previously known as Complex Event Processing Keep the data moving / in Motion instead of at Rest => raw events are (often) not stored

Lambda Architecture for Big

Lambda Architecture for Big Sources Collection (Analytical) Batch Processing Batch Result Store Access Logfiles ERP RDBMS Social Sensor Channel Raw (Reservoir) Batch compute Computed Information (Analytical) Real-Time Batch Processing compute Query Engine Real-Time Reports Service Analytic Machine Mobile Stream/Event Processing Messaging Alerting = in Motion = at Rest

Use Case 6) Social Media and Social Network Analysis Sources Collection (Analytical) Batch Processing Batch Result Store Access Social Channel Raw (Reservoir) Batch compute Computed Information (Analytical) Real-Time Batch Processing compute Query Engine Real-Time Reports Service Analytic Stream/Event Processing Messaging Alerting = in Motion = at Rest

Lambda Architecture for Big Combines (Big) at Rest with (Fast) in Motion Closes the gap from high-latency batch processing Keeps the raw information forever Makes it possible to rerun analytics operations on whole data set if necessary => because the old run had an error or => because we have found a better algorithm we want to apply Have to implement functionality twice Once for batch Once for real-time streaming

Kappa Architecture for Big

Kappa Architecture for Big Sources Collection Raw Reservoir Access ERP RDBMS Raw (Reservoir) Reports Logfiles Service Social Sensor Messaging (Analytical) Real-Time Batch Processing compute Analytic Machine Mobile Stream/Event Processing Messaging Alerting = in Motion = at Rest

Kappa Architecture for Big The solution for low latency use cases Process each event separately => low latency Process events in micro-batches => increases latency but offers better reliability Previously known as Complex Event Processing Keep the data moving / in Motion instead of at Rest

Unified Architecture for Big

Unified Architecture for Big Sources Collection (Analytical) Batch Processing (Calculate Models of incoming data) Batch Result Store Access Logfiles ERP RDBMS Social Sensor Machine Mobile Channel Raw (Reservoir) Batch compute (Analytical) Real-Time Batch Processing compute Prediction Models Stream/Event Processing Computed Information Query Engine Real-Time Messaging Reports Service Analytic Alerting = in Motion = at Rest

Use Case 7) Fraud Detection Sources Collection (Analytical) Batch Processing (Calculate Models of incoming data) Batch Result Store Access Logfiles ERP RDBMS Sensor Machine Mobile Channel Raw (Reservoir) Batch compute (Analytical) Real-Time Batch Processing compute Prediction Models Stream/Event Processing Computed Information Query Engine Real-Time Messaging Reports Service Analytic Alerting

Summary

Summary Know your use cases and then choose your architecture and the relevant components/ products/frameworks You don t have to use all the components of the Hadoop Ecosystem to be successful Big is still quite a young field and therefore there are no standard architectures available which have been used for years Lambda, Kappa Architecture are best practices architectures which you have to adapt to your environment

Name Referent Titel Referent Tel. +00 00 000 00 00 vorname.name@trivadis.com