Future Internet Technologies

Size: px
Start display at page:

Download "Future Internet Technologies"

Transcription

1 Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck FIT Until Now Architectures -Server SPDY Big Processing HTTP REST CoAP P2P Networking Design Principles SCTP IPv4 LISP IPv6 6LoWPAN Frame ATM Relay GSM HIP DTN Mobile IP MPLS Security - 04 Cryptology #2 1

2 Traditional Application-Level Architectures /Server Remote Procedure Call (RPC) REST (HTTP, CoAP, SPDY) Replication and Load Balancing Distributed Applications N-Tier Architectures Service-Oriented Architecture (SOA) Peer-to-Peer Distributed Hash Tables (DHT), #3 /Server Interaction Model No server-initiated data transfer Often due to statelessnes of server Results in unnecessary polling to transfer data from server to client Bandwidth, processing, and memory overhead Request Response Server 4 2

3 3- and 4-Tier-Architectures: 3-Tier DB-Server Server DB-Server Tier 1: Presentation Tier 2: Business Logic Tier 3: 3- and 4-Tier-Architectures: 4-Tier App Server 1 DB-Server Web Server App Server 2 DB-Server Tier 1: Presentation Tier 2: Web Server Tier 3: Application Server Tier 4: 3

4 Replication and Load Balancing Application and are replicated on many servers Load balancer distributes load across, e.g., LAMP hosts (Linux, Apache, MySQL, PHP) Bottleneck: base May be clustered (i.e., replicated and/or partitioned) to increase performance Problem: Consistency requirements (ACID) of SQL databases limit scalability App. Instance Server #1 Load Balancer App. Instance Server #2 State App. Instance Server #n #7 Distributed Applications Distribute application (including state) amongst different servers In contrast to multiple instances on multiple servers Typically, the application state is partitioned (e.g., Google BigTable) Better scalability for a certain class of applications More scalable if less strict consistency requirements are imposed (NoSQL) E.g., Facebook, Twitter, App. Instance Server #1 App. Instance Server #n Server #1 Server #n State App. Instance #8 4

5 Locality For Internet-scale applications, state is distributed globally are typically moved to where usage is expected E.g., German videos Germany #9 Distributed Applications Requirement on today s applications is to perform efficiently on an Internet scale Scalability w/o changing the application #10 5

6 Exemplary Application Ad Delivery Users User Behavior Mining Web GUI Friendship Graph Application Log Analysis Statistics #11 Internet-Scale Applications It s not about servers anymore and data processing services are important Issue: Efficient distribution, processing and storage of data Move away from fixed (n-tier) layering models Dynamic composition of processing services (at runtime continuous deployment) Communication using messages Required: concepts for loosely-coupled distributed applications #12 6

7 Message Queuing & Publish/Subscribe #13 Message Queuing Concept of message queues processing services communicate asynchronously using message queues Decouples data producers and consumers - - M 2 M 1 Message Queue #14 7

8 Message Queuing Mode of operation Producers dispatch messages into a queue Consumers take messages from a queue Multiple queues for different purposes can be created E.g., one for dispatching computational tasks, one for logging, Producers, consumers, and queues may run on different systems Producer Host #1 - - M 2 M 1 Consumer Message Queue Host #q Host #n #15 Message Queuing: Properties Distributed realization of the producer/consumer problem I.e., thread synchronization Producer and consumer are only aware of the queue Association can be changed (dynamically) w/o changing the application Producer and consumer don t need to be available at the same time Queue provides time-decoupling Allows asynchronous operation Producer - - M 2 M 1 Message Queue Consumer #16 8

9 Message Queuing: Properties Queues can have multiple consumers E.g., to distribute messages (e.g., tasks) to a set of consumers Sometimes called work queues Different message dispatching strategies possible (e.g., round robin, fairness, load-based, ) Producer - - T 4 T 3 Message Queue T 2 T 1 Worker #1 Worker #n #17 Delivery Semantics Best-effort Messages are lost if no consumer is listening Guaranteed (acknowledged) delivery On application layer (cf. DTN custody transfer) Persistent Messages are persisted until consumed Transient (not persistent) Messages are lost if the queue crashes Producer - - M 2 M 1 Message Queue Consumer #18 9

10 Message-Based RPC RPC mapped to request and response messages Correlation of request and response based on IDs Inherits delivery semantics from message queuing Request id = 123 Request Queue Response Queue Server Response id = 123 #19 Publish/Subscribe Pattern (Pub/Sub) Builds on top of individual message queues Publisher publish messages to an exchange Consumers subscribe to an exchange Exchanges route messages to queues based on subscriptions A broker manages a set of exchanges and queues Distribution of published messages to subscribers Broadcast (see below), topic-based, or content-based (next slides) Broker Producer Exchange - - M 2 M M 2 M 1 Consumer #1 Consumer #n #20 10

11 Topic-Based Publish/Subscribe Consumers subscribe to an exchange with a certain topic Messages are annotated with a topic Topics are arbitrary strings Exchange matches subscriptions and message topics and routes them to the according queues M 1 : news.germany M 2 : news.greece Producer - - M 2 M M 2 Consumer #1 Consumer #n #21 Content-Based Publish/Subscribe Consumers subscribe to an exchange with content-based filters Exchange routes messages based on a message s contents Typically structured data (e.g., JSON, XML) Most production systems use topic-based pub/sub M 1 : { company: apple, price: 123$ } Producer company = apple && 110 < price < Consumer # M 1 Consumer #n company = apple && price > 120 #22 11

12 Internet-Scale Applications and Pub/Sub Centralized brokers are a single point of failure and bottleneck for scalable applications Brokers can be clustered To increase availability and throughput Requires inter-broker message routing and a communication protocol Creates distributed (replicated, partitioned) subscription state and all associated issues Broker Exchange - - M 2 M M 2 M 1 Cluster of Brokers #23 Broker Clustering Producers and consumers may attach to any broker of the cluster How are messages routed? Important goal: efficiency (throughput, bandwidth, latency, consistency, processing cost) Cluster of Brokers Producer Broker? Broker Consumer Consumer Broker Broker Consumer Producer Security - 04 Cryptology #24 12

13 Topic Based Pub/Sub and Clustering Brokers compute a routing tree Based on upstream consumer subscriptions Subscription filters are aggregated along the path Producer news.eu.germany Broker news.eu.* Broker news.eu.germany Broker Upstream filter aggregation news.eu.greece news.eu.germany news.eu.germany Consumer #1 Consumer #2 Consumer #n #25 Implementations & Co. Pub/Sub Implementations RabbitMQ, Apache ActiveMQ, Amazon Simple Queue Service (Amazon SQS), Apache Qpid, IBM WebSphere MQ, Apache Kafka, Kestrel, Protocols Advanced Message Queuing Protocol (AMQP), ZeroMQ Message Transport Protocol (ZMTP), Streaming Text Oriented Messaging Protocol (STOMP), XMPP publish-subscribe extension, Others Java Message Service (JMS), #26 13

14 Pub/Sub: Conclusion Advantages Scalability / Clustering / Federation Loose coupling Publisher and subscriber unaware of each other Flexibility and extendibility (add new components at runtime, ) Disconnected operation Disadvantages Loose coupling: hard to give guarantees What happens if a message is stuck in a queue forever E.g., because there is no consumer Brokers are a centralized bottleneck What about efficient and Internet-scale distributed brokers? #27 Map / Reduce #28 14

15 Map / Reduce Goal: distributed processing of big data sets Typically too large to be processed on a single machine Moving all data would be too costly (time/bandwidth) Map / Reduce: Framework for processing embarrassingly parallel problems on a large number of computers Initially developed by Google (2004) Well-known open-source implementation: Apache Hadoop, #29 Parallels to Parallel Programming Programming paradigm to Split up a program into several tasks Process these tasks on different machines in parallel Requires that the algorithm is parallelizable Realization in Google s Map/Reduce Program is split on to n machines 1 master Splits up the full task into subtasks Assigns subtasks to workers Receives results from workers n-1 workers Receive jobs and data from master Return results to master #30 15

16 Example: Approximating π (1) Area of squares: A s = (2r) 2 = 4r 2 Area of circles: A c = πr 2 Α 4 s Α = π c 4Α π = Α s c r Approximation steps 1. Randomly generate k s points in square 2. Count number of points located in the circle k c 4ks 3. Approximate π k c #31 Example: Approximating π (2) Accuracy increases with larger k s Step 2 (check whether point is in circle) is parallelizable Execution Master generates k s random points in square Each of the n workers w i (1 i n) is assigned a list of points Each worker counts the number of points in the received list which are located in the circle (k i,c for worker w i ) and gives the result to the master Master computes 4 k s π n k i= 1 i,c #32 16

17 Parallelizing tasks (1) Unsolved problem if all algorithms solving problems for classes P or NP are parallelizable There are many such sequential algorithms without a known parallel equivalent Many scientists assume that there is no general parallelizability property #33 Parallelizing tasks (2) Obviously parallelizable tasks Brute force attack on encrypted documents Count word occurrence in several documents Not (obliviously) parallelizable tasks Compute the Fibonacci function: f k+2 = f k+1 + f k In general: tasks where the next step depends on the previous step(s) results #34 17

18 Map / Reduce Two or three steps Map Combine (optional) Reduce Dispatch Mode of operation Input data is either distributed (e.g., server log files) or split into multiple parts (e.g., single file) Map preprocesses input data to an intermediate format Framework dispatches intermediate results to reducers (see next slide) Reduce combines intermediate results to final results Figure source: Security - 04 Cryptology #35 The lecturer is the best. And this is written by research assistants. Yada, yada, yada, The lecturer is the best [ (the, 1), (lecturer, 1), (is, 1), (the, 1), (best, 1) ] [ (the, [1, 1]) ] [ (the, [2]) ] [ (the, 2), (lecturer, 1), (is, 1) (best, 1) ] Partition 1 Split 1 Map Partition 2 2 Partition n Red 1 Input data Split 2 Map Partition 1 Partition 2 Red 2 Red File, Callback, Partition n Split 3 Map Partition 1 Partition 2 Red n Partition n Intermediate results (partitioned) Security - 04 Cryptology #36 18

19 - Map: K x V (L x W) * - K and L: sets containing keys - V and W: sets containing values - All elements in a set are of the same data type (e.g., String) - Maps a key/value pair to a list of key/value pairs - Intermediate Results - Partitioned into n partitions - E.g., partition = hash(key) mod n - Transforms (L x W)* L x W* - E.g., (the, 1), (the, 1) (the, [1,1]) - Handled by the framework - Reduce: L x W* W* - E.g., (the, [1,1]) 2 - Framework saves result as (the, 2) Partition 1 Split 1 Map Partition 2 Partition n Red 1 Input data Split 2 Map Partition 1 Partition 2 Red 2 Red File, Callback, Partition n Split 3 Map Partition 1 Partition 2 Red n Partition n Security - 04 Cryptology #37 Example Map and Reduce Functions Input document: The lecturer is the best /* key: document name, value: document content */ map(string key, String value) { for(string w : value.split( )) { EmitIntermediate(w, 1); } } /* key: a word values: list of intermediate results */ reduce(string key, Iterator<int> values) { int result = 0; for (int value : values) { result += value; } Emit(result); } Map [(the, 1), (lecturer, 1), (is, 1), (the, 1), (best, 1)] Intermediate [(the,[1,1]), (lecturer, [1]), (is, [1]), (best, [1])] Reduce [(the, [2]), (lecturer, [1]), (is, [1]), (best, [1])] #38 19

20 Distributed Stream Processing #39 Distributed Stream Processing Pub/Sub Moving messages between data processing services efficiently Map/Reduce Move data processing services to the data Limitation: Batch processing, only for sufficiently parallelizable jobs Distributed Real-Time Stream Processing Mixture of Pub/Sub and Map/Reduce Guaranteed data processing No intermediate message brokers Goal: create graphs of data sources and data processors Continuous Streams Security - 04 Cryptology #40 20

21 Distributed Stream Processing Continuous real-time processing of data streams Batch processing in map/reduce Eventually emits a final result Continuous calculation of results Runs forever or until stopped Produces new results over time Scalability / Parallelism Applications are comprised of parallelizable data processing services No centralized component (brokers), components directly exchange data Applications can be distributed globally Internetscale Continuous Streams Security - 04 Cryptology #41 Example: Storm Framework [storm] Implementation of a Distributed Stream Processing framework Open-source project from Twitter Used by many companies (e.g., Twitter, Groupon, etc.) to run their business logic Source Processing Service Processing Service Processing Service Set of data sources and data processing services Connections defined by a graph in which streams are edges Source Processing Service Security - 04 Cryptology #42 21

22 Storm Streams and Tuples Stream: Unbounded sequence of tuples Tuple: Named list of values Tuple Tuple Tuple Tuple Tuple Tuple Security - 04 Cryptology #43 Storm Spouts A spout is a source of a stream E.g., a spout can read data and emit it as tuples from a File Message queue Web service RSS feed Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Figure source: Security - 04 Cryptology #44 22

23 Storm Bolt Bolts process input streams and produce new streams Bolts can implement arbitrary logic, e.g., Transformation functions Filters Aggregation functions Joins base access Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Figure source: Security - 04 Cryptology #45 Storm Topology A topology is a network of spouts and bolts Spouts and bolts are implemented in a parallelizable manner I.e., multiple instances can be spawned The framework executes them in parallel on a Storm cluster as tasks s communicate using message queues Figure source: Security - 04 Cryptology #46 23

24 Storm Example: Word Count Count words in a continuous stream of data and emit new word count for each word as soon as it changes Topology A spout reads sentences from some data source and emits them One bolt splits sentences into individual words Another bolt keeps track of the number of occurrences of individual words and emits the word and count on change (sentence, - M 3 M 2 M 1 Source (e.g. message queue) The lecturer is the ) sentences split (word, the ), (word, lecturer ), (word, is ), (word, the ) count (the, 1) (lecturer, 1) (is, 1) (the, 2) Figure source: Security - 04 Cryptology #47 Storm Example: Word Count Framework runs multiple instances of spout and bolt implementations in parallel To guarantee non-ambiguous results groupings can be defined Groupings Shuffle randomly picks a task Field makes sure one individual word always goes to the same task - M 3 M 2 M 1 Source (e.g. message queue) sentences Figure source: Shuffle grouping split (word, lecturer ), (word, the ) (word, car ), (word, drives ) Field grouping on word count Security - 04 Cryptology #48 (the, 1) (lecturer, 1) (the, 2) (is, 1) (car, 1) (drives, 1) 24

25 Storm Stream Groupings Stream groupings define to which task a tuple is sent when it is emitted Examples Shuffle & Field (previous slide) All: replicates to all tasks Global: task with lowest ID None: don t care Direct: emitter defines receiver Security - 04 Cryptology #49 Conclusion Humans create exponentially growing amounts of data Big data challenge Expressing data processing algorithms using functional primitives allows efficient scaling and distribution of computations on large (distributed) data sets Precondition: algorithm must be sufficiently parallelizable New paradigm to design applications Not only for big but also for small data Scale from start-up to global player Security - 04 Cryptology #50 25

26 Literature [mapreduce] MapReduce: Simplified Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, [storm] Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. [akka] Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. #51 26

WSO2 Message Broker. Scalable persistent Messaging System

WSO2 Message Broker. Scalable persistent Messaging System WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable Messaging Distributed Message Brokers WSO2 MB Architecture o Distributed Pub/sub architecture o Distributed Queues architecture

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

JoramMQ, a distributed MQTT broker for the Internet of Things

JoramMQ, a distributed MQTT broker for the Internet of Things JoramMQ, a distributed broker for the Internet of Things White paper and performance evaluation v1.2 September 214 mqtt.jorammq.com www.scalagent.com 1 1 Overview Message Queue Telemetry Transport () is

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Storage, Management and challenges. Ahmed Ali-Eldin Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

High Throughput Computing on P2P Networks. Carlos Pérez Miguel [email protected]

High Throughput Computing on P2P Networks. Carlos Pérez Miguel carlos.perezm@ehu.es High Throughput Computing on P2P Networks Carlos Pérez Miguel [email protected] Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured

More information

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues Dharmit Patel Faraj Khasib Shiva Srivastava Outline What is Distributed Queue Service? Major Queue Service

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Real-time Big Data Analytics with Storm

Real-time Big Data Analytics with Storm Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

Infrastructures for big data

Infrastructures for big data Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Andes: a highly scalable persistent messaging system

Andes: a highly scalable persistent messaging system Andes: a highly scalable persistent messaging system Charith Wickramarachchi, Srinath Perera, Shammi Jayasinghe,Sanjiva Weerawarana WSO2 Inc. Mountain View, CA USA {charith, srinath, shammi, sanjiva}@wso2.com

More information

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @

Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @ Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Building a Reliable Messaging Infrastructure with Apache ActiveMQ

Building a Reliable Messaging Infrastructure with Apache ActiveMQ Building a Reliable Messaging Infrastructure with Apache ActiveMQ Bruce Snyder IONA Technologies Bruce Synder Building a Reliable Messaging Infrastructure with Apache ActiveMQ Slide 1 Do You JMS? Bruce

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Openbus Documentation

Openbus Documentation Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:

More information

Putting Apache Kafka to Use!

Putting Apache Kafka to Use! Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Classic Grid Architecture

Classic Grid Architecture Peer-to to-peer Grids Classic Grid Architecture Resources Database Database Netsolve Collaboration Composition Content Access Computing Security Middle Tier Brokers Service Providers Middle Tier becomes

More information

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015 Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Introducing Storm 1 Core Storm concepts Topology design

Introducing Storm 1 Core Storm concepts Topology design Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource

More information

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon. Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

How To Write A Trusted Analytics Platform (Tap)

How To Write A Trusted Analytics Platform (Tap) Trusted Analytics Platform (TAP) TAP Technical Brief October 2015 TAP Technical Brief Overview Trusted Analytics Platform (TAP) is open source software, optimized for performance and security, that accelerates

More information

Predictive Analytics with Storm, Hadoop, R on AWS

Predictive Analytics with Storm, Hadoop, R on AWS Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

GigaSpaces Real-Time Analytics for Big Data

GigaSpaces Real-Time Analytics for Big Data GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and

More information

Chapter 1 - Web Server Management and Cluster Topology

Chapter 1 - Web Server Management and Cluster Topology Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Apache HBase. Crazy dances on the elephant back

Apache HBase. Crazy dances on the elephant back Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage

More information

Wisdom from Crowds of Machines

Wisdom from Crowds of Machines Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence Exploring Oracle E-Business Suite Load Balancing Options Venkat Perumal IT Convergence Objectives Overview of 11i load balancing techniques Load balancing architecture Scenarios to implement Load Balancing

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Database Scalability and Oracle 12c

Database Scalability and Oracle 12c Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data [email protected] Warning I will be covering topics and saying things that will cause a rethink in

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

Event-based middleware services

Event-based middleware services 3 Event-based middleware services The term event service has different definitions. In general, an event service connects producers of information and interested consumers. The service acquires events

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure

More information

Kafka & Redis for Big Data Solutions

Kafka & Redis for Big Data Solutions Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)

More information

Next Generation Open Source Messaging with Apache Apollo

Next Generation Open Source Messaging with Apache Apollo Next Generation Open Source Messaging with Apache Apollo Hiram Chirino Red Hat Engineer Blog: http://hiramchirino.com/blog/ Twitter: @hiramchirino GitHub: https://github.com/chirino 1 About me Hiram Chirino

More information

Domain driven design, NoSQL and multi-model databases

Domain driven design, NoSQL and multi-model databases Domain driven design, NoSQL and multi-model databases Java Meetup New York, 10 November 2014 Max Neunhöffer www.arangodb.com Max Neunhöffer I am a mathematician Earlier life : Research in Computer Algebra

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

FIVE SIGNS YOU NEED HTML5 WEBSOCKETS

FIVE SIGNS YOU NEED HTML5 WEBSOCKETS FIVE SIGNS YOU NEED HTML5 WEBSOCKETS A KAAZING WHITEPAPER Copyright 2011 Kaazing Corporation. All rights reserved. FIVE SIGNS YOU NEED HTML5 WEBSOCKETS A KAAZING WHITEPAPER HTML5 Web Sockets is an important

More information

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

CSE-E5430 Scalable Cloud Computing Lecture 11

CSE-E5430 Scalable Cloud Computing Lecture 11 CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Limitations of Packet Measurement

Limitations of Packet Measurement Limitations of Packet Measurement Collect and process less information: Only collect packet headers, not payload Ignore single packets (aggregate) Ignore some packets (sampling) Make collection and processing

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Advanced Data Management Technologies

Advanced Data Management Technologies ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information

More information

Distributed File Systems

Distributed File Systems Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

Big Systems, Big Data

Big Systems, Big Data Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,

More information

Scalable Architecture on Amazon AWS Cloud

Scalable Architecture on Amazon AWS Cloud Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect

More information

Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus. 2010 IBM Corporation

Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus. 2010 IBM Corporation Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus Agenda BPM Follow-up SOA and ESB Introduction Key SOA Terms SOA Traps ESB Core functions Products and Standards Mediation Modules

More information

INTRODUCING APACHE IGNITE An Apache Incubator Project

INTRODUCING APACHE IGNITE An Apache Incubator Project WHITE PAPER BY GRIDGAIN SYSTEMS FEBRUARY 2015 INTRODUCING APACHE IGNITE An Apache Incubator Project COPYRIGHT AND TRADEMARK INFORMATION 2015 GridGain Systems. All rights reserved. This document is provided

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Big Data Storage Architecture Design in Cloud Computing

Big Data Storage Architecture Design in Cloud Computing Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,

More information

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing

More information

NOT IN KANSAS ANY MORE

NOT IN KANSAS ANY MORE NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky

More information

Apache Kafka Your Event Stream Processing Solution

Apache Kafka Your Event Stream Processing Solution 01 0110 0001 01101 Apache Kafka Your Event Stream Processing Solution White Paper www.htcinc.com Contents 1. Introduction... 2 1.1 What are Business Events?... 2 1.2 What is a Business Data Feed?... 2

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

A standards-based approach to application integration

A standards-based approach to application integration A standards-based approach to application integration An introduction to IBM s WebSphere ESB product Jim MacNair Senior Consulting IT Specialist [email protected] Copyright IBM Corporation 2005. All rights

More information