Creating Big Data Applications with Spring XD



Similar documents
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Upcoming Announcements

HADOOP. Revised 10/19/2015

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Workshop on Hadoop with Big Data

Qsoft Inc

Hadoop Ecosystem B Y R A H I M A.

HDP Hadoop From concept to deployment.

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

XpoLog Competitive Comparison Sheet

Internet of Things. Opportunity Challenges Solutions

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Databricks. A Primer

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Unified Batch & Stream Processing Platform

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Comprehensive Analytics on the Hortonworks Data Platform

Peers Techno log ies Pv t. L td. HADOOP

How To Write A Nosql Database In Spring Data Project

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Databricks. A Primer

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

State-of-the-Art ENTERPRISE JAVA APPLICATIONS WITH SPRING BOOT

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Course Highlights

TRAINING PROGRAM ON BIGDATA/HADOOP

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Kafka & Redis for Big Data Solutions

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HDP Enabling the Modern Data Architecture

Big Data Analytics - Accelerated. stream-horizon.com

Introduction to Big Data Training

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Big Data Management and Security

Technical Overview Simple, Scalable, Object Storage Software

How To Write A Trusted Analytics Platform (Tap)

Real-time Big Data Analytics with Storm

Pivotal HD Enterprise

Towards Smart and Intelligent SDN Controller

Hadoop: The Definitive Guide

Moving From Hadoop to Spark

Complete Java Classes Hadoop Syllabus Contact No:

Integrating a Big Data Platform into Government:

BIG DATA - HADOOP PROFESSIONAL amron

Real Time Big Data Processing

COURSE CONTENT Big Data and Hadoop Training

Implement Hadoop jobs to extract business value from large and varied data sets

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

What s Cooking in KNIME

ITG Software Engineering

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

#TalendSandbox for Big Data

Deploying Hadoop with Manager

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

The Technology of the Business Data Lake

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Apache Flink Next-gen data analysis. Kostas

Putting Apache Kafka to Use!

The Internet of Things

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

BIG DATA SOLUTION DATA SHEET

Hortonworks Data Platform for Hadoop and SAP HANA

Real Time Data Processing using Spark Streaming

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Analytics Nokia

Deploy Your First CF App on Azure with Template and Service Broker. Thomas Shao, Rita Zhang, Bin Xia Microsoft Azure Team

Collaborative Open Market to Place Objects at your Service

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Training Catalog. Summer 2015 Training Catalog. Apache Hadoop Training from the Experts. Apache Hadoop Training From the Experts

GigaSpaces Real-Time Analytics for Big Data

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Search and Real-Time Analytics on Big Data

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Building Data-Driven Internet of Things (IoT) Applications

BIG DATA HADOOP TRAINING

Cloud3DView: Gamifying Data Center Management

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Certified Big Data and Apache Hadoop Developer VS-1221

Performance Testing of Big Data Applications

Streaming items through a cluster with Spark Streaming

Cloudera Manager Training: Hands-On Exercises

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Transcription:

Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont

THE FASTEST PATH TO NEW BUSINESS VALUE

Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Introduction 4

Spring XD - Overview Platform for Big Data Applications Ingestion, Processing, Movement, Analytics Stream and Batch Processing Scalable Distributed Runtime Support for Deep Analytics Proven Spring Technologies 5

Spring XD - Why yet another Big Data Platform? Alternative to Frameworks like Flume, Oozie, Sqoop, Storm Just one Platform instead of many Common things easy, complex things possible Complementary to many technologies Big SQL / MPP Databases - Impala, HAWQ Stream Processing - Apache Spark NoSQL DataStores - Cassandra, MongoDB 6

extreme X Data D Spring XD - one stop shop for developing and deploying Big Data Apps 7

Spring XD - 10,000 Foot View >_ Rest Spring XD Runtime taps Streams ingest BIDIRECTIONAL Jobs workflow RDBMS Redis Compute NoSQL HDFS export Predictive Modelling R, SAS 8

Spring XD - Easy to Setup and Run Store incoming HTTP data into HDFS 9

Spring XD - Easy to Setup and Run 1. Install via package manager / unzip 2. Start $ xd-singlenode $ xd-shell 3. Define xd:> stream create ingest --definition http hdfs 4. Run xd:> stream deploy ingest Yes, writing HTTP Data to HDFS can be that simple! 10

Core Concepts 11

Spring XD - Core Concepts Runtime Modules Streams Taps Analytics Jobs Extensibility Deployment 12

Spring XD - Runtime Hosts Stream Processing & Batch Workflows & Analytics Manages Component Distribution Communication via MessageBus Additional Services Configuration / Cluster State: ZooKeeper Analytics: Redis, In-Memory Message Bus: Redis, RabbitMQ, Kafka, Local 13

Spring XD - Instance Types XD-Admin Assigns Modules to Containers Manages Cluster Failover & HA XD-Container Loads / Executes Modules Connects to Data Bus Standalone, YARN, Cloud Foundry XD UI XD Shell XD XD Admin Admin Leader XD Admin Leader Leader XD Container module module module module Batch Job State DB Analytics Repository ZK XD Container module module module module Kafka/RabbitMQ/Redis 14

Spring XD - Runtime Modes XD Admin XD Admin JVM ZK DB MB ZK JVM DB JVM MB single-node standalone XD Container Module JVM multi-node distributed XD Container Module JVM XD Container Module JVM Development Production 15

Spring XD - Distributed Runtime XDA XDA deploy Zookeeper XDC time XDC log XDC XDA = XD Admin XDC = XD Container bind Message Bus 16

Modules 17

Modules Unit of execution Source, Sink, Processor, Jobs Defined in XML or JVM Language Spring config file with Spring Bean Definitions Can have Parameters 50+ already included in XD Define new Modules via Composition 18

Modules - Overview HTTP SFTP Tail File Mail Syslog TCP / Source TCP Client Reactor IP #20 JMS RabbitMQ Time MQTT Mongo Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Processor Shell Command Script #13 Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Sink Mail Null Sink #20 Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge + 1 19

Streams 20

Streams Programming model for real-time processing How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source Processor 0 * Sink Data is pumped through MessageBus Spring Integration Components Stream Source Message Bus Processor Sink 21

Streams - Example Transform payload incoming from HTTP to uppercase and send to log stream create test1 --definition "http transform --expression=payload.touppercase() log --deploy Source Processor Option Sink 22

Taps 23

Taps Special type of Stream Consume data along the processing pipeline Original stream stays unaffected Collect metrics and perform analytics Stream Source Processor Sink Processor Sink Message Bus Tap 24

Taps Example First create the stream stream create test1 definition "http transform --expression=payload.touppercase() log --deploy Then create the tap: onto transform stage, add prefix and send to log stream create test1tap --definition tap:stream:test1.transform > transform --expression='tapped: '+payload log --deploy Tap Source Redirection 25

Analytics 26

Analytics Counters Simple Counter - how many tweets? Field Value Counter - how many for tag=#java? Aggregate Counter - how many tweets for #java per time interval? Gauges Gauge - what was the last seen value? Rich Gauge - what was the last seen value/avg/min/max? Backed by Redis, In-Memory via Spring Data Repositories Accessible via XD-Shell and REST API on XD-Admin 27

Advanced Analytics Processor Modules Python: numpy, pandas, scikit-learn, NLTK, SimpleCV Shell: R-Project rscript, OpenCV Java / Groovy PMML Processor Module Predictive Model Markup Language Description of Parameterised Data Mining Models Allows to Operationalise Predictive Models Real-time evaluation and scoring 28

Jobs 29

Jobs Programming Model for Batch Processing Create, Schedule, Execute and Monitor Spring Batch and Spring Hadoop Components CSV to JDBC FTP to Jobs HDFS JDBC to HDFS #5 HDFS to JDBC HDFS to MongoDB 30

Jobs - Example Create job from existing job definition job create --name "helloworld-job" --definition helloworld" --deploy Run job once job launch --name "helloworld-job" Run job periodically stream create --name "hw-cron" --definition "trigger --cron='0/5 * * * * *' > queue:job:helloworld-job deploy 31

Management 32

Spring XD - Shell CLI based on Spring Shell Manages Streams, Jobs, Analytics and Deployment Completion / Assist Many built-in Commands try help Started via xd-shell 33

Spring XD - Admin UI Management Interface accessible from XD-Admin Node XD-ADMIN:9393/admin-ui 34

Spring XD - REST Interface accessible from XD-Admin Node used by XD-Shell and Admin-UI http://xd-admin:9393 35

Extensibility 36

Extensibility Custom Modules Source, Sink, Processor, Job Spring Integration, Spring Batch E.g. to wrap a Java Library Upload new modules via XD-Shell / REST Register custom Spring Expression Language Aliases from java.lang.double.parsedouble(payload.sensorvalue) to #parsedouble(payload.sensorvalue) Scripts Collection of XD commands Automation 37

Deployment 38

Deployment deploy or --deploy stream deploy firststream stream create secondstream --deploy Deployment Manifest Customize via --properties Parameter Control # of Module Instances Define Target Server or Group Direct Binding Stream Data Partitioning 39

Deployment Manifest - Module Count http worker hdfs stream deploy --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 http http worker worker worker hdfs hdfs hdfs worker 40

Deployment Manifest - Module Placement http worker hdfs stream deploy WEB worker --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria= group.contains( WEB ) http http worker worker worker hdfs hdfs hdfs xd/bin/xd-container --groups="web" 41

Deployment Manifest - Data Partitioning http worker hdfs stream deploy WEB 0 worker --properties module.worker.count=4, module.http.producer.partitionkeyexpression= payload.customerid http http 1 2 3 worker worker worker hdfs hdfs hdfs partition := hash(payload.customerid) % worker.count 42

Applications 43

Spring XD - Measuring Live Usage for a Major Sports League Measuring live video usage through mobile applications 44

Spring XD - IoT Connected Car Journey and Range Prediction 45

Spring XD - Smartgrid ACM Distributed Event Based Systems 2014 Scalable, Real-Time Analytics, High Volume Sensor Data Short-Term Load Forecasting in a Power Grid Sensor Data from Smart Plugs Stream Components Sensor Data Ingestion Data Aggregation Load Prediction Demo Analytics via REST 46

What s next? 47

Roadmap - 1.2 and beyond Custom Modules in HDFS More OOTB Modules Web based Editor for Streams & Jobs Apache Ambari Support Security Enhancements Spring XD on Pivotal Cloud Foundry GA Release Planned for May 2015 48

Learn more Project http://projects.spring.io/spring-xd GitHub https://github.com/spring-projects/spring-xd Wiki http://docs.spring.io/spring-xd/docs/current/reference/html/ Samples https://github.com/spring-projects/spring-xd-samples Modules https://github.com/spring-projects/spring-xd-modules JIRA https://jira.spring.io/browse/xd Stackoverflow http://stackoverflow.com/questions/tagged/spring-xd 49

Spring XD - Takeaway Increased Productivity through out-of-the-box components Unified runtime for both Real-time and Batch use cases Scalable, Distributed and Fault Tolerant Runtime Closed Loop Analytics through online (stream) and offline (batch) data Data Ingestion, Processing, Movement, Analytics Swiss-army knife of data movement and data pipelines Repeatable turnkey solution for next generation data-centric use cases 50

Learn More. Stay Connected. Twitter: twitter.com/springcentral YouTube: spring.io/video LinkedIn: spring.io/linkedin Google Plus: spring.io/gplus 51 Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Backup Slides 52

Lambda Architecture 53

Lambda Architecture

Lambda Architecture - Spring XD Gemfire XD> Spring Stream Processing Serving Layer Speed Layer Real-time Views Spring Boot Batch Processing Workflow Orchestration Ingest Data Lake Spring Boot HAWQ Spring Boot Export Analytics Batch Layer Predictive Analytics Batch Views Spring Boot

Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised algorithm from the data Slow process Usually large data volume -> done offline as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing 56

PMML Predictive Model Markup Language Open Standard Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) Current Version 4.2.1 Lingua Franca for Predictive Models Bridge the Gap between Data Scientists and Engineers 57

Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation trained model Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2 nd (Edi/on,(2012,(p.(7. 58

Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github