Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared"

Transcription

1 Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 1 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

2 Guido Schmutz Working for Trivadis for more than 18 years Oracle ACE Director for Fusion Middleware and SOA Co-Author of different books Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Member of Trivadis Architecture Board Technology Trivadis More than 5 years of software development experience Contact: Blog: Twitter: gschmutz Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

3 Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: OPERATION Trivadis Services takes over the interacting operation of your IT systems. 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

4 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

5 What is Stream? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event / Complex Event (CEP) 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

6 Why Stream? RPC Stream Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 6 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

7 How to design a Stream System? Event Stream event Collecting/ result Event Stream event Collecting event result Event Stream event Collecting event Queue (Persist) event result 7 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

8 How to scale a Stream System? event Collecting Thread 1 event event Thread 1 result Event Stream Queue (Persist) event event event result Collecting Thread Thread Collecting Thread n Thread n 8 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

9 How to scale a Stream System? Collecting Collecting Process 1 Thread 1 event Queue 1 (Persist) event Collecting Process 1 Process 1 result event Event Stream event Collecting Collecting Process 1 Thread 1 Collecting Process 1 Process 1 event event result Queue (Persist) Queue n (Persist) Collecting Process 1 Process 1 9 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

10 How to scale a Stream System? Collecting Collecting Process Process 11 e Q1 A A Thread Process 1 1 e Q1 B B Thread Process 1 1 Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process e Qn A A Thread Process n 10 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

11 How to make (stateful) Stream System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process 11 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

12 How to make (stateful) Stream System reliable? Solution 1: using active/passive system (hot replication) Both systems process the full load In case of a failure, automatically switch and use the passive system Stragglers slow down both active and passive system Active Event Stream Collecting Collecting Process Process e Q A A Thread Process State Passive e Q B B Thread Process Collecting Collecting Process Process e Q A A Thread Process State e Q B B Thread Process 1 State Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture = State in-memory and/or on-disk

13 How to make (stateful) Stream System reliable? Solution : Upstream backup Nodes buffer sent messages and reply them to new node in case of failure Stragglers are treated as failures Event Stream Collecting Collecting Process Process e Q A A Thread Process e Q B B Thread Process State buffer State = Buffer for replay in-memory and/or on-disk = State in-memory and/or on-disk 13 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

14 Models Batch Familiar concept of processing data en masse Generally incurs a high-latency (Event-) Stream A one-at-a-time processing model A datum is processed as it arrives Sub-second latency Difficult to process state data efficiently Micro-Batching A special case of batch processing with very small batch sizes (tiny) A nice mix between batching and streaming At cost of latency Gives stateful computation, making windowing an easy task 14 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

15 Message Delivery Semantics At most once [0,1] Messages my be lost Messages never redelivered At least once [1.. n] Messages will never be lost but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] Messages are never lost Messages are never redelivered Perfect message delivery Incurs higher latency for transactional semantics 15 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

16 Requirements dictate the choice Latency Is performance of streaming application paramount Development Cost Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance Is high-availability of primary concern 16 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

17 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 17 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

18 Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. A highly distributed real-time computation system Provides general primitives to do real-time computation To simplify working with queues & workers scalable and fault-tolerant complementary to Hadoop Written in Clojure, supports Java, Clojure Originated at Backtype, acquired by Twitter in 011 Open Sourced late 011 Part of Apache Incubator since September Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

19 Apache Storm Core concepts Tuple Core data structure in storm Immutable Set of Key/value pairs You can think of Storm tuples as events Values must be serializable T T T T T T T T Stream Key abstraction of Storm an unbounded sequence of tuples that can be processed in parallel by Storm Each stream is given ID and bolts can produce and consume tuples from these streams on the basis of their ID Each stream also has an associated schema of the tuples that will flow through it 19 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

20 Apache Storm Core concepts Topology Wires data and functions via a DAG (directed acyclic graph) Executes on many machines similar to a MR job in Hadoop Spout Source of data streams (tuples) can be run in reliable and unreliable mode Bolt Consumes 1+ streams and potentially produces new streams Complex operations often require multiple steps and thus multiple bolts Calculate, Filter, Aggregate, Join, Talk to database Spout Source of Stream B Spout Subscribes: A Emits: C Bolt Subscribes: A Emits: D Bolt Subscribes: A & B Emits: - Bolt Subscribes: C & D Emits: - Bolt 0 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

21 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #Superbowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

22 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning = 1 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

23 Storm How does it work? NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Split Sentence Split Sentence Peyton Superbowl Superbowl NFL Manning... Word Count Word Count INCR Peyton Peyton = 1 INCR Superbowl Superbowl = 1 INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning= 1 Report Peyton= 1 Superbowl = NFL = 1 Manning = 1 3 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

24 Storm - Topology Twitter Spout Shuffle Split Sentence Split Sentence Fields Word Count Word Count Global Report Each Spout or Bolt are running N instances in parallel Shuffle grouping Fields grouping All grouping Global grouping None grouping Direct grouping Local or Shuffle grouping is random grouping is grouped by value, such that equal value results in equal task replicates to all tasks makes all tuples go to one task makes bolt run in the same thread as bolt/spout it subscribes to producer (task that emits) controls which consumer will receive similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior. 4 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

25 Storm - Creating Topology 5 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

26 Using a NoSQL database for storing results (keeping state with counter type columns) Twitter Stream NFL: Peyton Manning and Denver s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdqzn #seahawks #broncos #Superbowl Twitter Spout #Superbowl Hashtag Splitter Hashtag Splitter superbowl superbowl seahawks broncos Hashtag Counter Hashtag Counter INCR superbowl INCR superbowl INCR seahawks INCR broncos superbowl = 1 seahawks= 1 broncos = 1 6 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

27 Storm Trident High-Level abstraction on top of storm Simplifies building topologies Core data model is the stream Processed as a series of batches (micro-batches) Stream is partitioned among nodes in cluster 5 kinds of operations in Trident Operations that apply locally to each partition and cause no network transfer Repartitioning operations that don t change the contents Aggregation operations that do network transfer Operations on grouped streams Merges and Joins 7 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

28 Storm Trident - Creating Topology Twitter Stream tweet Twitter Spout tweet Bolt Hashtag Splitter hashtag local Hashtag Normalizer hashtag groupby Bolt Persistent Aggregate 8 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

29 Trident Concepts - Function takes in a set of input fields and emits zero or more tuples as output fields of the output tuple are appended to the original input tuple in the stream If a function emits no tuples, the original input tuple is filtered out Otherwise the input tuple is duplicated for each output tuple 9 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

30 Storm Core vs. Storm Trident Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Models Event-Streaming Micro-Batching DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN 30 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

31 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 31 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

32 Apache Spark Apache Spark is a fast and general engine for large-scale data processing The hot trend in Big Data! Based on 007 Microsoft Dryad paper Written in Scala, supports Java, Python, SQL and R Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Runs everywhere runs on Hadoop, Mesos, standalone or in the cloud One of the largest OSS communities in big data with over 00 contributors in 50+ organizations Originally developed 009 in UC Berkley s AMPLab Open Sourced in 010 since 014 part of Apache Software foundation 3 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

33 Apache Spark Spark Core General execution engine for the Spark platform In-memory computing capabilities deliver speed General execution model supports wide variety of use cases DAG-based Ease of development native APIs in Java, Scala and Python Spark Streaming Run a streaming computation as a series of very small, deterministic batch jobs Batch size as low as ½ sec, latency of about 1 sec Exactly-once semantics Potential for combining batch and streaming processing in same system Started in 01, first alpha release in Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

34 Apache Spark - Generality Libraries Spark SQL (Batch ) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLLib, Spark R (Machine Learning) GraphX (Graph ) Core Runtime Spark Core API and Execution Model Cluster Resource Managers Data Stores Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB 34 Adapted from C. Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

35 Apache Spark Core concepts Resilient Distributed Dataset (RDD) Core Spark abstraction Collections of objects (partitions) spread across cluster Partitions can be stored in-memory or on-disk (local) Enables parallel processing on data sets Build through parallel transformations Immutable, recomputable, fault tolerant Contains transformation history ( lineage ) for whole data set Operations Stateless Transformations (map, filter, groupby) Actions (count, collect, save) 35 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

36 RDD Lineage Example HDFS File Input 1 HadoopRDD SparkContext.hadoopFile() HDFS File Input Transformations (Lazy) filter() FilteredRDD map() MappedRDD SparkContext.hadoopFile() HadoopRDD map() MappedRDD join() ShuffledRDD Action (Execute Transformations) HDFS File Output SparkContext.saveAsHadoopFile() 36 Adapted from Chris Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

37 RDD Execution Example FileRDD Partition 1 groupbykey() ShuffledRDD Partition 1 join() Partition. Partition 5 Partition. Partition 5 FileRDD Partition 1 Partition. Partition 5 ShuffledRDD Partition 1 Partition. Partition 5 join() ShuffledRDD Partition 1 Partition. Partition 5 FileRDD Partition 1 filter() FileRDD Partition 1 map() MappedRDD Partition 1 Partition Partition Partition 37 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

38 Apache Spark Streaming Core concepts Discretized Stream (DStream) Core Spark Streaming abstraction micro batches of RDD s Operations similar to RDD Input DStreams Represents the stream of raw data received from streaming sources Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. Custom Sources can be easily written for custom data sources Operations Same as Spark Core Additional Stateful transformations (window, reducebywindow) 38 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

39 Discretized Stream (DStream) Time Increasing time 1 time time 3. time n Input Stream message message message message DStream Transformation Lineage DStream MappedDStream map() 1 message 1 message. message n 1 f(message 1) f(message ). f(message n) message 1 message. message n f(message 1) f(message ). f(message n) 3 message 1 message. message n 3 f(message 1) f(message ). f(message n) n message 1 message. message n n f(message 1) f(message ). f(message n) Actions Trigger Spark Jobs saveashadoopfiles() result 1 result. result n result 1 result. result n result 1 result. result n result 1 result. result n 39 Adapted from Chris Fregly: Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

40 Spark Streaming Example 40 CAS Big Data - FH Bern Stream- and Event- Event Streams - Apache Storm

41 Storm Core vs. Storm Trident vs. Spark Streaming Core Storm Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 80 contributors Adoption *** * * Language Options Models Java, Clojure, Scala, Python, Ruby, Java, Clojure, Scala Java, Scala Python (coming) Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery Guarantees At most once / At least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE 41 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

42 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 4 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

43 Unified Log That s what most people think about logs [0/Jul/01:13::6-0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver= HTTP/1.1" [0/Jul/01:13::6-0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver= HTTP/1.1" [0/Jul/01:13::8-0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" [0/Jul/01:13:: ] "POST /wp-admin/post.php HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" [0/Jul/01:13:: ] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver= HTTP/1.1" [0/Jul/01:13:: ] "POST /wp-admin/admin-ajax.php HTTP/1.1" But this is what we mean here by Log a structured log (records are numbered beginning with 0 based on order they are written) aka. commit log or journal 1 st record Next record written Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

44 Central Unified Log for (real-time) subscription Take all the organization s data and put it into a central log for subscription Properties of the Unified Log: Unified: Enterprise, single deployment Append-Only: events are appended, no update in place => immutable Ordered: each event has an offset, which is unique within a shard Fast: should be able to handle thousands of messages / sec Distributed: lives on a cluster of machines Collector writes reads Consumer System A (time = 6) reads Consumer System B (time = 10) 44 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

45 Apache Kafka - Overview A distributed publish-subscribe messaging system Designed for processing of real time activity stream data (logs, metrics collections, social media streams, ) Initially developed at LinkedIn, now part of Apache Does not follow JMS Standards and does not use JMS API Kafka maintains feeds of messages in topics Producer Producer Producer Partition 0 Anatomy of a topic: Kafka Cluster Partition Writes Consumer Consumer Consumer Partition old new 1 45 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

46 Apache Kafka - Motivation LinkedIn s motivation for Kafka was: A unified platform for handling all the real-time data feeds a large company might have. Must haves High throughput to support high volume event feeds. Support real-time processing of these feeds to create new, derived feeds. Support large data backlogs to handle periodic ingestion from offline systems. Support low-latency delivery to handle more traditional messaging use cases. Guarantee fault-tolerance in the presence of machine failures. 46 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

47 Apache Kafka - Performance Kafka at LinkedIn 10+ billion writes per day 17k messages per second (average) 55+ billion messages per day to real-time consumers Up to million writes/sec on 3 cheap machines Using 3 producers on 3 different machines 47 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

48 Apache Kafka - Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 48 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

49 Apache Kafka two Options for Log Cleanup Retaining a window of data Ideal for event data Window can be defined in time (days) or space (GBs) defaults to 1 week Retain a complete log (log compaction) Ideal for keyed data Keep a space-efficient complete log of changes Log compaction runs in the background Ensures that always at least the last known value for each message key within the log of data is retained 49 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

50 Data Flow Graphs using Unified Log Stream processing allows for computing feeds off of other feeds Meter Readings Collector Raw Meter Readings Derived feeds are no different than original feeds they are computed off Customer Enrich / Transform Aggregate by Minute Persist Single deployment of Unified Log but logically different feeds Meter with Customer Aggregate by Minute Meter by Minute Persist Raw Meter Readings Meter by Customer by Minute Meter by Minute 50 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

51 Agenda 1. Introduction. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Architectures 6. Summary 51 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

52 Architectural Pattern: Standalone Event Stream Business Rule Management System Rules Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Enterprise Service Bus Analytical 5 Applications DB 5 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

53 Architectural Pattern: Event Stream as part of Lambda Architecture Hadoop Big Data Infrastructure Social Media Streams HDFS Map/ Reduce Result Store Enterprise Service Bus Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) State Store / Event Store Result Store Enterprise Event Bus Analytical 53 Applications DB 53 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

54 Architectural Pattern: Event Stream as part of Kappa Architecture Hadoop Big Data Infrastructure HDFS Replay Social Media Streams Internet of Things Event Event Cloud Enterprise Event Bus (Ingress) Event (ESP / CEP) Result Store Enterprise Service Bus Analytical 54 Applications DB State Store / Event Store 54 Einheitlicher Umgang mit Ereignisströmen - Unified Log Architecture

55 Questions and answers... Guido Schmutz Technology Manager BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 55 Apache Storm vs. Spark Streaming Two Stream Platforms compared 3rd December 014

Big Data Architecture

Big Data Architecture Big Architecture Guido Schmutz BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Guido Schmutz Working for Trivadis for more than

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

Beyond Hadoop with Apache Spark and BDAS

Beyond Hadoop with Apache Spark and BDAS Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Streaming items through a cluster with Spark Streaming

Streaming items through a cluster with Spark Streaming Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Apache Spark. Christopher Homa. October 11, Apache Spark is an open source cluster computing framework.

Apache Spark. Christopher Homa. October 11, Apache Spark is an open source cluster computing framework. Apache Spark Christopher Homa October 11, 2016 Overview Apache Spark is an open source cluster computing framework. Initially developed at UC Berkeley s AMPLab in 2009, Spark was donated to Apache and

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Spark Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache

More information

SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!

SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble! SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure

More information

Real Time Data Processing using Spark Streaming

Real Time Data Processing using Spark Streaming Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm

More information

Spark: Making Big Data Interactive & Real-Time

Spark: Making Big Data Interactive & Real-Time Spark: Making Big Data Interactive & Real-Time Matei Zaharia UC Berkeley / MIT www.spark-project.org What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop Improves efficiency

More information

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Kafka & Redis for Big Data Solutions

Kafka & Redis for Big Data Solutions Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

More information

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2400 watchers on Github

More information

Openbus Documentation

Openbus Documentation Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop

More information

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com

Unified Big Data Analytics Pipeline. 连 城 lian@databricks.com Unified Big Data Analytics Pipeline 连 城 lian@databricks.com What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an

More information

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years

More information

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS . 3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade

More information

SOLVING ANALYTICAL PROBLEMS USING SPARK, CASSANDRA, DATASTAX. Rohit Bhardwaj Principal Cloud Engineer Twitter: rbhardwaj1

SOLVING ANALYTICAL PROBLEMS USING SPARK, CASSANDRA, DATASTAX. Rohit Bhardwaj Principal Cloud Engineer Twitter: rbhardwaj1 SOLVING ANALYTICAL PROBLEMS USING SPARK, CASSANDRA, DATASTAX Rohit Bhardwaj Principal Cloud Engineer rbhardwaj@kronos.com Twitter: rbhardwaj1 AGENDA Big data characteristics Real time analytics Apache

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

NOT IN KANSAS ANY MORE

NOT IN KANSAS ANY MORE NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky

More information

WELCOME. Where and When should I use the Oracle Service Bus (OSB) Guido Schmutz. UKOUG Conference 2012 04.12.2012

WELCOME. Where and When should I use the Oracle Service Bus (OSB) Guido Schmutz. UKOUG Conference 2012 04.12.2012 WELCOME Where and When should I use the Oracle Bus () Guido Schmutz UKOUG Conference 2012 04.12.2012 BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1

More information

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets

More information

BIG DATA ANALYTICS For REAL TIME SYSTEM

BIG DATA ANALYTICS For REAL TIME SYSTEM BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics In Organizations Mark Vervuurt Cluster Data Science & Analytics AGENDA 1. Yellow Elephant 2. Data Ingestion & Complex Event Processing 3. SQL on Hadoop 4. NoSQL 5. InMemory 6. Data Science & Machine Learning

More information

Big Data Visualization. Apache Spark and Zeppelin

Big Data Visualization. Apache Spark and Zeppelin Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark

More information

Next-Gen Big Data Analytics using the Spark stack

Next-Gen Big Data Analytics using the Spark stack Next-Gen Big Data Analytics using the Spark stack Jason Dai Chief Architect of Big Data Technologies Software and Services Group, Intel Agenda Overview Apache Spark stack Next-gen big data analytics Our

More information

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1 Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Conquering Big Data with BDAS (Berkeley Data Analytics)

Conquering Big Data with BDAS (Berkeley Data Analytics) UC BERKELEY Conquering Big Data with BDAS (Berkeley Data Analytics) Ion Stoica UC Berkeley / Databricks / Conviva Extracting Value from Big Data Insights, diagnosis, e.g.,» Why is user engagement dropping?»

More information

Making Big Data Processing Simple with Spark. Matei Zaharia

Making Big Data Processing Simple with Spark. Matei Zaharia Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast

More information

Predictive Analytics with Storm, Hadoop, R on AWS

Predictive Analytics with Storm, Hadoop, R on AWS Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using

More information

Introducing Storm 1 Core Storm concepts Topology design

Introducing Storm 1 Core Storm concepts Topology design Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource

More information

Moving From Hadoop to Spark

Moving From Hadoop to Spark + Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com sujee@elephantscale.com Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee

More information

Distributed Computing" with Open-Source Software

Distributed Computing with Open-Source Software Distributed Computing" with Open-Source Software Reza Zadeh Presented at Infosys OSSmosis Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use

More information

Spark Streaming!! Real-time big-data processing

Spark Streaming!! Real-time big-data processing Spark Streaming!! Real-time big-data processing Tathagata Das (TD) UC BERKELEY What is Spark Streaming? BlinkDB Shark Spark Streaming GraphX MLlib Spark Extends Spark for doing big data stream processing

More information

Dominik Wagenknecht Accenture

Dominik Wagenknecht Accenture Dominik Wagenknecht Accenture Improving Mainframe Performance with Hadoop October 17, 2014 Organizers General Partner Top Media Partner Media Partner Supporters About me Dominik Wagenknecht Accenture Vienna

More information

Big Data and Fast Data combined is it possible?

Big Data and Fast Data combined is it possible? Big Data and Fast Data combined is it possible? Ulises Fasoli DBTA Workshop 2014 - Bern BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN 1 Ulises

More information

Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

More information

Big Data Analysis: Apache Storm Perspective

Big Data Analysis: Apache Storm Perspective Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

CAPTURING & PROCESSING REAL-TIME DATA ON AWS CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent

More information

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Rakam: Distributed Analytics API

Rakam: Distributed Analytics API Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users

More information

Apache Kafka Your Event Stream Processing Solution

Apache Kafka Your Event Stream Processing Solution 01 0110 0001 01101 Apache Kafka Your Event Stream Processing Solution White Paper www.htcinc.com Contents 1. Introduction... 2 1.1 What are Business Events?... 2 1.2 What is a Business Data Feed?... 2

More information

Wisdom from Crowds of Machines

Wisdom from Crowds of Machines Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @ Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need

More information

From Spark to Ignition:

From Spark to Ignition: From Spark to Ignition: Fueling Your Business on Real-Time Analytics Eric Frenkiel, MemSQL CEO June 29, 2015 San Francisco, CA What s in Store For This Presentation? 1. MemSQL: A real-time database for

More information

Big Data Processing. Patrick Wendell Databricks

Big Data Processing. Patrick Wendell Databricks Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks

More information

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analytics - Accelerated. stream-horizon.com Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements

More information

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com Big Data Primer Alex Sverdlov alex@theparticle.com 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.

More information

Putting Apache Kafka to Use!

Putting Apache Kafka to Use! Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable

More information

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,

More information

Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

More information

Spark and the Big Data Library

Spark and the Big Data Library Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Future Internet Technologies

Future Internet Technologies Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY

More information

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying

More information

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Big Data Analytics with Cassandra, Spark & MLLib

Big Data Analytics with Cassandra, Spark & MLLib Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE

More information

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC http://www.ignite.incubator.apache.org @apacheignite @dsetrakyan Agenda About In- Memory

More information

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the

More information

Welcome. Five Cool Use Cases for the Spring component of the Oracle SOA Suite. Guido Schmutz UKOUG Conference

Welcome. Five Cool Use Cases for the Spring component of the Oracle SOA Suite. Guido Schmutz UKOUG Conference Welcome Five Cool Use Cases for the Spring component of the Oracle SOA Suite Guido Schmutz UKOUG Conference 2012 BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART

More information

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,

More information

The Internet of Things and Big Data: Intro

The Internet of Things and Big Data: Intro The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific

More information

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data Spark and Shark High- Speed In- Memory Analytics over Hadoop and Hive Data Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li,

More information

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part: Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination

More information

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Simplifying Big Data Analytics: Unifying Batch and Stream Processing John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!! Streaming Analy.cs S S S Scale- up Database Data And Compute Grid

More information

In Memory Accelerator for MongoDB

In Memory Accelerator for MongoDB In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000

More information