Future Internet Technologies

Similar documents

WSO2 Message Broker. Scalable persistent Messaging System

Architectures for massive data management

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

JoramMQ, a distributed MQTT broker for the Internet of Things

Introduction to Hadoop

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Real-time Big Data Analytics with Storm

Hadoop and Map-Reduce. Swati Gore

Introduction to Parallel Programming and MapReduce

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Infrastructures for big data

Lecture Data Warehouse Systems

Big Data With Hadoop

Andes: a highly scalable persistent messaging system

Using Kafka to Optimize Data Movement and System Integration. Alex

CSE-E5430 Scalable Cloud Computing Lecture 2

Building a Reliable Messaging Infrastructure with Apache ActiveMQ

Introduction to Hadoop

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

MapReduce (in the cloud)

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Openbus Documentation

Putting Apache Kafka to Use!

Chapter 7. Using Hadoop Cluster and MapReduce

Classic Grid Architecture

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

MAPREDUCE Programming Model

Introducing Storm 1 Core Storm concepts Topology design

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Integrating Big Data into the Computing Curricula

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

A programming model in Cloud: MapReduce

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Challenges for Data Driven Systems

How To Write A Trusted Analytics Platform (Tap)

Predictive Analytics with Storm, Hadoop, R on AWS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Cloud Computing at Google. Architecture

GigaSpaces Real-Time Analytics for Big Data

Chapter 1 - Web Server Management and Cluster Topology

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Apache HBase. Crazy dances on the elephant back

Wisdom from Crowds of Machines

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Database Scalability and Oracle 12c

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Event-based middleware services

System Models for Distributed and Cloud Computing

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Kafka & Redis for Big Data Solutions

Next Generation Open Source Messaging with Apache Apollo

Domain driven design, NoSQL and multi-model databases

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop IST 734 SS CHUNG

FIVE SIGNS YOU NEED HTML5 WEBSOCKETS

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

L1: Introduction to Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 11

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Limitations of Packet Measurement

Big Data and Apache Hadoop s MapReduce

Advanced Data Management Technologies

Distributed File Systems

Parallel Processing of cluster by Map Reduce

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Big Systems, Big Data

Scalable Architecture on Amazon AWS Cloud

Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus IBM Corporation

INTRODUCING APACHE IGNITE An Apache Incubator Project

Lecture 10 - Functional programming: Hadoop and MapReduce

Data-Intensive Computing with Map-Reduce and Hadoop

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Big Data Storage Architecture Design in Cloud Computing

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

NOT IN KANSAS ANY MORE

Apache Kafka Your Event Stream Processing Solution

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

A standards-based approach to application integration

Transcription:

Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY Big Processing HTTP REST CoAP P2P Networking Design Principles SCTP IPv4 LISP IPv6 6LoWPAN Frame ATM Relay GSM HIP DTN Mobile IP MPLS Security - 04 Cryptology #2 1

Traditional Application-Level Architectures /Server Remote Procedure Call (RPC) REST (HTTP, CoAP, SPDY) Replication and Load Balancing Distributed Applications N-Tier Architectures Service-Oriented Architecture (SOA) Peer-to-Peer Distributed Hash Tables (DHT), #3 /Server Interaction Model No server-initiated data transfer Often due to statelessnes of server Results in unnecessary polling to transfer data from server to client Bandwidth, processing, and memory overhead Request Response Server 4 2

3- and 4-Tier-Architectures: 3-Tier DB-Server Server DB-Server Tier 1: Presentation Tier 2: Business Logic Tier 3: 3- and 4-Tier-Architectures: 4-Tier App Server 1 DB-Server Web Server App Server 2 DB-Server Tier 1: Presentation Tier 2: Web Server Tier 3: Application Server Tier 4: 3

Replication and Load Balancing Application and are replicated on many servers Load balancer distributes load across, e.g., LAMP hosts (Linux, Apache, MySQL, PHP) Bottleneck: base May be clustered (i.e., replicated and/or partitioned) to increase performance Problem: Consistency requirements (ACID) of SQL databases limit scalability App. Instance Server #1 Load Balancer App. Instance Server #2 State App. Instance Server #n #7 Distributed Applications Distribute application (including state) amongst different servers In contrast to multiple instances on multiple servers Typically, the application state is partitioned (e.g., Google BigTable) Better scalability for a certain class of applications More scalable if less strict consistency requirements are imposed (NoSQL) E.g., Facebook, Twitter, App. Instance Server #1 App. Instance Server #n Server #1 Server #n State App. Instance #8 4

Locality For Internet-scale applications, state is distributed globally are typically moved to where usage is expected E.g., German videos Germany #9 Distributed Applications Requirement on today s applications is to perform efficiently on an Internet scale Scalability w/o changing the application #10 5

Exemplary Application Ad Delivery Users User Behavior Mining Web GUI Friendship Graph Application Log Analysis Statistics #11 Internet-Scale Applications It s not about servers anymore and data processing services are important Issue: Efficient distribution, processing and storage of data Move away from fixed (n-tier) layering models Dynamic composition of processing services (at runtime continuous deployment) Communication using messages Required: concepts for loosely-coupled distributed applications #12 6

Message Queuing & Publish/Subscribe #13 Message Queuing Concept of message queues processing services communicate asynchronously using message queues Decouples data producers and consumers - - M 2 M 1 Message Queue #14 7

Message Queuing Mode of operation Producers dispatch messages into a queue Consumers take messages from a queue Multiple queues for different purposes can be created E.g., one for dispatching computational tasks, one for logging, Producers, consumers, and queues may run on different systems Producer Host #1 - - M 2 M 1 Consumer Message Queue Host #q Host #n #15 Message Queuing: Properties Distributed realization of the producer/consumer problem I.e., thread synchronization Producer and consumer are only aware of the queue Association can be changed (dynamically) w/o changing the application Producer and consumer don t need to be available at the same time Queue provides time-decoupling Allows asynchronous operation Producer - - M 2 M 1 Message Queue Consumer #16 8

Message Queuing: Properties Queues can have multiple consumers E.g., to distribute messages (e.g., tasks) to a set of consumers Sometimes called work queues Different message dispatching strategies possible (e.g., round robin, fairness, load-based, ) Producer - - T 4 T 3 Message Queue T 2 T 1 Worker #1 Worker #n #17 Delivery Semantics Best-effort Messages are lost if no consumer is listening Guaranteed (acknowledged) delivery On application layer (cf. DTN custody transfer) Persistent Messages are persisted until consumed Transient (not persistent) Messages are lost if the queue crashes Producer - - M 2 M 1 Message Queue Consumer #18 9

Message-Based RPC RPC mapped to request and response messages Correlation of request and response based on IDs Inherits delivery semantics from message queuing Request id = 123 Request Queue Response Queue Server Response id = 123 #19 Publish/Subscribe Pattern (Pub/Sub) Builds on top of individual message queues Publisher publish messages to an exchange Consumers subscribe to an exchange Exchanges route messages to queues based on subscriptions A broker manages a set of exchanges and queues Distribution of published messages to subscribers Broadcast (see below), topic-based, or content-based (next slides) Broker Producer Exchange - - M 2 M 1 - - M 2 M 1 Consumer #1 Consumer #n #20 10

Topic-Based Publish/Subscribe Consumers subscribe to an exchange with a certain topic Messages are annotated with a topic Topics are arbitrary strings Exchange matches subscriptions and message topics and routes them to the according queues M 1 : news.germany M 2 : news.greece Producer - - M 2 M 1 - - - M 2 Consumer #1 Consumer #n #21 Content-Based Publish/Subscribe Consumers subscribe to an exchange with content-based filters Exchange routes messages based on a message s contents Typically structured data (e.g., JSON, XML) Most production systems use topic-based pub/sub M 1 : { company: apple, price: 123$ } Producer company = apple && 110 < price < 120 - - - - Consumer #1 - - - M 1 Consumer #n company = apple && price > 120 #22 11

Internet-Scale Applications and Pub/Sub Centralized brokers are a single point of failure and bottleneck for scalable applications Brokers can be clustered To increase availability and throughput Requires inter-broker message routing and a communication protocol Creates distributed (replicated, partitioned) subscription state and all associated issues Broker Exchange - - M 2 M 1 - - M 2 M 1 Cluster of Brokers #23 Broker Clustering Producers and consumers may attach to any broker of the cluster How are messages routed? Important goal: efficiency (throughput, bandwidth, latency, consistency, processing cost) Cluster of Brokers Producer Broker? Broker Consumer Consumer Broker Broker Consumer Producer Security - 04 Cryptology #24 12

Topic Based Pub/Sub and Clustering Brokers compute a routing tree Based on upstream consumer subscriptions Subscription filters are aggregated along the path Producer news.eu.germany Broker news.eu.* Broker news.eu.germany Broker Upstream filter aggregation news.eu.greece news.eu.germany news.eu.germany Consumer #1 Consumer #2 Consumer #n #25 Implementations & Co. Pub/Sub Implementations RabbitMQ, http://www.rabbitmq.com Apache ActiveMQ, http://activemq.apache.org Amazon Simple Queue Service (Amazon SQS), http://aws.amazon.com/sqs Apache Qpid, http://qpid.apache.org IBM WebSphere MQ, http://www-01.ibm.com/software/integration/wmq Apache Kafka, http://incubator.apache.org/kafka Kestrel, http://robey.github.com/kestrel Protocols Advanced Message Queuing Protocol (AMQP), http://www.amqp.org ZeroMQ Message Transport Protocol (ZMTP), http://www.zeromq.org Streaming Text Oriented Messaging Protocol (STOMP), http://stomp.github.com XMPP publish-subscribe extension, http://xmpp.org/extensions/xep-0060.html Others Java Message Service (JMS), http://java.sun.com/developer/technicalarticles/ecommerce/jms #26 13

Pub/Sub: Conclusion Advantages Scalability / Clustering / Federation Loose coupling Publisher and subscriber unaware of each other Flexibility and extendibility (add new components at runtime, ) Disconnected operation Disadvantages Loose coupling: hard to give guarantees What happens if a message is stuck in a queue forever E.g., because there is no consumer Brokers are a centralized bottleneck What about efficient and Internet-scale distributed brokers? #27 Map / Reduce #28 14

Map / Reduce Goal: distributed processing of big data sets Typically too large to be processed on a single machine Moving all data would be too costly (time/bandwidth) Map / Reduce: Framework for processing embarrassingly parallel problems on a large number of computers Initially developed by Google (2004) Well-known open-source implementation: Apache Hadoop, http://hadoop.apache.org #29 Parallels to Parallel Programming Programming paradigm to Split up a program into several tasks Process these tasks on different machines in parallel Requires that the algorithm is parallelizable Realization in Google s Map/Reduce Program is split on to n machines 1 master Splits up the full task into subtasks Assigns subtasks to workers Receives results from workers n-1 workers Receive jobs and data from master Return results to master #30 15

Example: Approximating π (1) Area of squares: A s = (2r) 2 = 4r 2 Area of circles: A c = πr 2 Α 4 s Α = π c 4Α π = Α s c r Approximation steps 1. Randomly generate k s points in square 2. Count number of points located in the circle k c 4ks 3. Approximate π k c #31 Example: Approximating π (2) Accuracy increases with larger k s Step 2 (check whether point is in circle) is parallelizable Execution Master generates k s random points in square Each of the n workers w i (1 i n) is assigned a list of points Each worker counts the number of points in the received list which are located in the circle (k i,c for worker w i ) and gives the result to the master Master computes 4 k s π n k i= 1 i,c #32 16

Parallelizing tasks (1) Unsolved problem if all algorithms solving problems for classes P or NP are parallelizable There are many such sequential algorithms without a known parallel equivalent Many scientists assume that there is no general parallelizability property #33 Parallelizing tasks (2) Obviously parallelizable tasks Brute force attack on encrypted documents Count word occurrence in several documents Not (obliviously) parallelizable tasks Compute the Fibonacci function: f k+2 = f k+1 + f k In general: tasks where the next step depends on the previous step(s) results #34 17

Map / Reduce Two or three steps Map Combine (optional) Reduce Dispatch Mode of operation Input data is either distributed (e.g., server log files) or split into multiple parts (e.g., single file) Map preprocesses input data to an intermediate format Framework dispatches intermediate results to reducers (see next slide) Reduce combines intermediate results to final results Figure source: http://wikis.gm.fh-koeln.de/wiki_db/uploads/datenbanken/mapreduce/mapreduce.png Security - 04 Cryptology #35 The lecturer is the best. And this is written by research assistants. Yada, yada, yada, The lecturer is the best [ (the, 1), (lecturer, 1), (is, 1), (the, 1), (best, 1) ] [ (the, [1, 1]) ] [ (the, [2]) ] [ (the, 2), (lecturer, 1), (is, 1) (best, 1) ] Partition 1 Split 1 Map Partition 2 2 Partition n Red 1 Input data Split 2 Map Partition 1 Partition 2 Red 2 Red File, Callback, Partition n Split 3 Map Partition 1 Partition 2 Red n Partition n Intermediate results (partitioned) Security - 04 Cryptology #36 18

- Map: K x V (L x W) * - K and L: sets containing keys - V and W: sets containing values - All elements in a set are of the same data type (e.g., String) - Maps a key/value pair to a list of key/value pairs - Intermediate Results - Partitioned into n partitions - E.g., partition = hash(key) mod n - Transforms (L x W)* L x W* - E.g., (the, 1), (the, 1) (the, [1,1]) - Handled by the framework - Reduce: L x W* W* - E.g., (the, [1,1]) 2 - Framework saves result as (the, 2) Partition 1 Split 1 Map Partition 2 Partition n Red 1 Input data Split 2 Map Partition 1 Partition 2 Red 2 Red File, Callback, Partition n Split 3 Map Partition 1 Partition 2 Red n Partition n Security - 04 Cryptology #37 Example Map and Reduce Functions Input document: The lecturer is the best /* key: document name, value: document content */ map(string key, String value) { for(string w : value.split( )) { EmitIntermediate(w, 1); } } /* key: a word values: list of intermediate results */ reduce(string key, Iterator<int> values) { int result = 0; for (int value : values) { result += value; } Emit(result); } Map [(the, 1), (lecturer, 1), (is, 1), (the, 1), (best, 1)] Intermediate [(the,[1,1]), (lecturer, [1]), (is, [1]), (best, [1])] Reduce [(the, [2]), (lecturer, [1]), (is, [1]), (best, [1])] #38 19

Distributed Stream Processing #39 Distributed Stream Processing Pub/Sub Moving messages between data processing services efficiently Map/Reduce Move data processing services to the data Limitation: Batch processing, only for sufficiently parallelizable jobs Distributed Real-Time Stream Processing Mixture of Pub/Sub and Map/Reduce Guaranteed data processing No intermediate message brokers Goal: create graphs of data sources and data processors Continuous Streams Security - 04 Cryptology #40 20

Distributed Stream Processing Continuous real-time processing of data streams Batch processing in map/reduce Eventually emits a final result Continuous calculation of results Runs forever or until stopped Produces new results over time Scalability / Parallelism Applications are comprised of parallelizable data processing services No centralized component (brokers), components directly exchange data Applications can be distributed globally Internetscale Continuous Streams Security - 04 Cryptology #41 Example: Storm Framework [storm] Implementation of a Distributed Stream Processing framework Open-source project from Twitter Used by many companies (e.g., Twitter, Groupon, etc.) to run their business logic Source Processing Service Processing Service Processing Service Set of data sources and data processing services Connections defined by a graph in which streams are edges Source Processing Service Security - 04 Cryptology #42 21

Storm Streams and Tuples Stream: Unbounded sequence of tuples Tuple: Named list of values Tuple Tuple Tuple Tuple Tuple Tuple Security - 04 Cryptology #43 Storm Spouts A spout is a source of a stream E.g., a spout can read data and emit it as tuples from a File Message queue Web service RSS feed Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Figure source: http://storm-project.net/images/topology.png Security - 04 Cryptology #44 22

Storm Bolt Bolts process input streams and produce new streams Bolts can implement arbitrary logic, e.g., Transformation functions Filters Aggregation functions Joins base access Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Figure source: http://storm-project.net/images/topology.png Security - 04 Cryptology #45 Storm Topology A topology is a network of spouts and bolts Spouts and bolts are implemented in a parallelizable manner I.e., multiple instances can be spawned The framework executes them in parallel on a Storm cluster as tasks s communicate using message queues Figure source: https://github.com/nathanmarz/storm/wiki/tutorial Security - 04 Cryptology #46 23

Storm Example: Word Count Count words in a continuous stream of data and emit new word count for each word as soon as it changes Topology A spout reads sentences from some data source and emits them One bolt splits sentences into individual words Another bolt keeps track of the number of occurrences of individual words and emits the word and count on change (sentence, - M 3 M 2 M 1 Source (e.g. message queue) The lecturer is the ) sentences split (word, the ), (word, lecturer ), (word, is ), (word, the ) count (the, 1) (lecturer, 1) (is, 1) (the, 2) Figure source: https://github.com/nathanmarz/storm/wiki/tutorial Security - 04 Cryptology #47 Storm Example: Word Count Framework runs multiple instances of spout and bolt implementations in parallel To guarantee non-ambiguous results groupings can be defined Groupings Shuffle randomly picks a task Field makes sure one individual word always goes to the same task - M 3 M 2 M 1 Source (e.g. message queue) sentences Figure source: https://github.com/nathanmarz/storm/wiki/tutorial Shuffle grouping split (word, lecturer ), (word, the ) (word, car ), (word, drives ) Field grouping on word count Security - 04 Cryptology #48 (the, 1) (lecturer, 1) (the, 2) (is, 1) (car, 1) (drives, 1) 24

Storm Stream Groupings Stream groupings define to which task a tuple is sent when it is emitted Examples Shuffle & Field (previous slide) All: replicates to all tasks Global: task with lowest ID None: don t care Direct: emitter defines receiver Security - 04 Cryptology #49 Conclusion Humans create exponentially growing amounts of data Big data challenge Expressing data processing algorithms using functional primitives allows efficient scaling and distribution of computations on large (distributed) data sets Precondition: algorithm must be sufficiently parallelizable New paradigm to design applications Not only for big but also for small data Scale from start-up to global player Security - 04 Cryptology #50 25

Literature [mapreduce] MapReduce: Simplified Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, 2004. http://research.google.com/archive/mapreduce.html [storm] Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. http://stormproject.net [akka] Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. http://akka.io #51 26