Future Internet Technologies
|
|
|
- Egbert Ryan
- 10 years ago
- Views:
Transcription
1 Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck FIT Until Now Architectures -Server SPDY Big Processing HTTP REST CoAP P2P Networking Design Principles SCTP IPv4 LISP IPv6 6LoWPAN Frame ATM Relay GSM HIP DTN Mobile IP MPLS Security - 04 Cryptology #2 1
2 Traditional Application-Level Architectures /Server Remote Procedure Call (RPC) REST (HTTP, CoAP, SPDY) Replication and Load Balancing Distributed Applications N-Tier Architectures Service-Oriented Architecture (SOA) Peer-to-Peer Distributed Hash Tables (DHT), #3 /Server Interaction Model No server-initiated data transfer Often due to statelessnes of server Results in unnecessary polling to transfer data from server to client Bandwidth, processing, and memory overhead Request Response Server 4 2
3 3- and 4-Tier-Architectures: 3-Tier DB-Server Server DB-Server Tier 1: Presentation Tier 2: Business Logic Tier 3: 3- and 4-Tier-Architectures: 4-Tier App Server 1 DB-Server Web Server App Server 2 DB-Server Tier 1: Presentation Tier 2: Web Server Tier 3: Application Server Tier 4: 3
4 Replication and Load Balancing Application and are replicated on many servers Load balancer distributes load across, e.g., LAMP hosts (Linux, Apache, MySQL, PHP) Bottleneck: base May be clustered (i.e., replicated and/or partitioned) to increase performance Problem: Consistency requirements (ACID) of SQL databases limit scalability App. Instance Server #1 Load Balancer App. Instance Server #2 State App. Instance Server #n #7 Distributed Applications Distribute application (including state) amongst different servers In contrast to multiple instances on multiple servers Typically, the application state is partitioned (e.g., Google BigTable) Better scalability for a certain class of applications More scalable if less strict consistency requirements are imposed (NoSQL) E.g., Facebook, Twitter, App. Instance Server #1 App. Instance Server #n Server #1 Server #n State App. Instance #8 4
5 Locality For Internet-scale applications, state is distributed globally are typically moved to where usage is expected E.g., German videos Germany #9 Distributed Applications Requirement on today s applications is to perform efficiently on an Internet scale Scalability w/o changing the application #10 5
6 Exemplary Application Ad Delivery Users User Behavior Mining Web GUI Friendship Graph Application Log Analysis Statistics #11 Internet-Scale Applications It s not about servers anymore and data processing services are important Issue: Efficient distribution, processing and storage of data Move away from fixed (n-tier) layering models Dynamic composition of processing services (at runtime continuous deployment) Communication using messages Required: concepts for loosely-coupled distributed applications #12 6
7 Message Queuing & Publish/Subscribe #13 Message Queuing Concept of message queues processing services communicate asynchronously using message queues Decouples data producers and consumers - - M 2 M 1 Message Queue #14 7
8 Message Queuing Mode of operation Producers dispatch messages into a queue Consumers take messages from a queue Multiple queues for different purposes can be created E.g., one for dispatching computational tasks, one for logging, Producers, consumers, and queues may run on different systems Producer Host #1 - - M 2 M 1 Consumer Message Queue Host #q Host #n #15 Message Queuing: Properties Distributed realization of the producer/consumer problem I.e., thread synchronization Producer and consumer are only aware of the queue Association can be changed (dynamically) w/o changing the application Producer and consumer don t need to be available at the same time Queue provides time-decoupling Allows asynchronous operation Producer - - M 2 M 1 Message Queue Consumer #16 8
9 Message Queuing: Properties Queues can have multiple consumers E.g., to distribute messages (e.g., tasks) to a set of consumers Sometimes called work queues Different message dispatching strategies possible (e.g., round robin, fairness, load-based, ) Producer - - T 4 T 3 Message Queue T 2 T 1 Worker #1 Worker #n #17 Delivery Semantics Best-effort Messages are lost if no consumer is listening Guaranteed (acknowledged) delivery On application layer (cf. DTN custody transfer) Persistent Messages are persisted until consumed Transient (not persistent) Messages are lost if the queue crashes Producer - - M 2 M 1 Message Queue Consumer #18 9
10 Message-Based RPC RPC mapped to request and response messages Correlation of request and response based on IDs Inherits delivery semantics from message queuing Request id = 123 Request Queue Response Queue Server Response id = 123 #19 Publish/Subscribe Pattern (Pub/Sub) Builds on top of individual message queues Publisher publish messages to an exchange Consumers subscribe to an exchange Exchanges route messages to queues based on subscriptions A broker manages a set of exchanges and queues Distribution of published messages to subscribers Broadcast (see below), topic-based, or content-based (next slides) Broker Producer Exchange - - M 2 M M 2 M 1 Consumer #1 Consumer #n #20 10
11 Topic-Based Publish/Subscribe Consumers subscribe to an exchange with a certain topic Messages are annotated with a topic Topics are arbitrary strings Exchange matches subscriptions and message topics and routes them to the according queues M 1 : news.germany M 2 : news.greece Producer - - M 2 M M 2 Consumer #1 Consumer #n #21 Content-Based Publish/Subscribe Consumers subscribe to an exchange with content-based filters Exchange routes messages based on a message s contents Typically structured data (e.g., JSON, XML) Most production systems use topic-based pub/sub M 1 : { company: apple, price: 123$ } Producer company = apple && 110 < price < Consumer # M 1 Consumer #n company = apple && price > 120 #22 11
12 Internet-Scale Applications and Pub/Sub Centralized brokers are a single point of failure and bottleneck for scalable applications Brokers can be clustered To increase availability and throughput Requires inter-broker message routing and a communication protocol Creates distributed (replicated, partitioned) subscription state and all associated issues Broker Exchange - - M 2 M M 2 M 1 Cluster of Brokers #23 Broker Clustering Producers and consumers may attach to any broker of the cluster How are messages routed? Important goal: efficiency (throughput, bandwidth, latency, consistency, processing cost) Cluster of Brokers Producer Broker? Broker Consumer Consumer Broker Broker Consumer Producer Security - 04 Cryptology #24 12
13 Topic Based Pub/Sub and Clustering Brokers compute a routing tree Based on upstream consumer subscriptions Subscription filters are aggregated along the path Producer news.eu.germany Broker news.eu.* Broker news.eu.germany Broker Upstream filter aggregation news.eu.greece news.eu.germany news.eu.germany Consumer #1 Consumer #2 Consumer #n #25 Implementations & Co. Pub/Sub Implementations RabbitMQ, Apache ActiveMQ, Amazon Simple Queue Service (Amazon SQS), Apache Qpid, IBM WebSphere MQ, Apache Kafka, Kestrel, Protocols Advanced Message Queuing Protocol (AMQP), ZeroMQ Message Transport Protocol (ZMTP), Streaming Text Oriented Messaging Protocol (STOMP), XMPP publish-subscribe extension, Others Java Message Service (JMS), #26 13
14 Pub/Sub: Conclusion Advantages Scalability / Clustering / Federation Loose coupling Publisher and subscriber unaware of each other Flexibility and extendibility (add new components at runtime, ) Disconnected operation Disadvantages Loose coupling: hard to give guarantees What happens if a message is stuck in a queue forever E.g., because there is no consumer Brokers are a centralized bottleneck What about efficient and Internet-scale distributed brokers? #27 Map / Reduce #28 14
15 Map / Reduce Goal: distributed processing of big data sets Typically too large to be processed on a single machine Moving all data would be too costly (time/bandwidth) Map / Reduce: Framework for processing embarrassingly parallel problems on a large number of computers Initially developed by Google (2004) Well-known open-source implementation: Apache Hadoop, #29 Parallels to Parallel Programming Programming paradigm to Split up a program into several tasks Process these tasks on different machines in parallel Requires that the algorithm is parallelizable Realization in Google s Map/Reduce Program is split on to n machines 1 master Splits up the full task into subtasks Assigns subtasks to workers Receives results from workers n-1 workers Receive jobs and data from master Return results to master #30 15
16 Example: Approximating π (1) Area of squares: A s = (2r) 2 = 4r 2 Area of circles: A c = πr 2 Α 4 s Α = π c 4Α π = Α s c r Approximation steps 1. Randomly generate k s points in square 2. Count number of points located in the circle k c 4ks 3. Approximate π k c #31 Example: Approximating π (2) Accuracy increases with larger k s Step 2 (check whether point is in circle) is parallelizable Execution Master generates k s random points in square Each of the n workers w i (1 i n) is assigned a list of points Each worker counts the number of points in the received list which are located in the circle (k i,c for worker w i ) and gives the result to the master Master computes 4 k s π n k i= 1 i,c #32 16
17 Parallelizing tasks (1) Unsolved problem if all algorithms solving problems for classes P or NP are parallelizable There are many such sequential algorithms without a known parallel equivalent Many scientists assume that there is no general parallelizability property #33 Parallelizing tasks (2) Obviously parallelizable tasks Brute force attack on encrypted documents Count word occurrence in several documents Not (obliviously) parallelizable tasks Compute the Fibonacci function: f k+2 = f k+1 + f k In general: tasks where the next step depends on the previous step(s) results #34 17
18 Map / Reduce Two or three steps Map Combine (optional) Reduce Dispatch Mode of operation Input data is either distributed (e.g., server log files) or split into multiple parts (e.g., single file) Map preprocesses input data to an intermediate format Framework dispatches intermediate results to reducers (see next slide) Reduce combines intermediate results to final results Figure source: Security - 04 Cryptology #35 The lecturer is the best. And this is written by research assistants. Yada, yada, yada, The lecturer is the best [ (the, 1), (lecturer, 1), (is, 1), (the, 1), (best, 1) ] [ (the, [1, 1]) ] [ (the, [2]) ] [ (the, 2), (lecturer, 1), (is, 1) (best, 1) ] Partition 1 Split 1 Map Partition 2 2 Partition n Red 1 Input data Split 2 Map Partition 1 Partition 2 Red 2 Red File, Callback, Partition n Split 3 Map Partition 1 Partition 2 Red n Partition n Intermediate results (partitioned) Security - 04 Cryptology #36 18
19 - Map: K x V (L x W) * - K and L: sets containing keys - V and W: sets containing values - All elements in a set are of the same data type (e.g., String) - Maps a key/value pair to a list of key/value pairs - Intermediate Results - Partitioned into n partitions - E.g., partition = hash(key) mod n - Transforms (L x W)* L x W* - E.g., (the, 1), (the, 1) (the, [1,1]) - Handled by the framework - Reduce: L x W* W* - E.g., (the, [1,1]) 2 - Framework saves result as (the, 2) Partition 1 Split 1 Map Partition 2 Partition n Red 1 Input data Split 2 Map Partition 1 Partition 2 Red 2 Red File, Callback, Partition n Split 3 Map Partition 1 Partition 2 Red n Partition n Security - 04 Cryptology #37 Example Map and Reduce Functions Input document: The lecturer is the best /* key: document name, value: document content */ map(string key, String value) { for(string w : value.split( )) { EmitIntermediate(w, 1); } } /* key: a word values: list of intermediate results */ reduce(string key, Iterator<int> values) { int result = 0; for (int value : values) { result += value; } Emit(result); } Map [(the, 1), (lecturer, 1), (is, 1), (the, 1), (best, 1)] Intermediate [(the,[1,1]), (lecturer, [1]), (is, [1]), (best, [1])] Reduce [(the, [2]), (lecturer, [1]), (is, [1]), (best, [1])] #38 19
20 Distributed Stream Processing #39 Distributed Stream Processing Pub/Sub Moving messages between data processing services efficiently Map/Reduce Move data processing services to the data Limitation: Batch processing, only for sufficiently parallelizable jobs Distributed Real-Time Stream Processing Mixture of Pub/Sub and Map/Reduce Guaranteed data processing No intermediate message brokers Goal: create graphs of data sources and data processors Continuous Streams Security - 04 Cryptology #40 20
21 Distributed Stream Processing Continuous real-time processing of data streams Batch processing in map/reduce Eventually emits a final result Continuous calculation of results Runs forever or until stopped Produces new results over time Scalability / Parallelism Applications are comprised of parallelizable data processing services No centralized component (brokers), components directly exchange data Applications can be distributed globally Internetscale Continuous Streams Security - 04 Cryptology #41 Example: Storm Framework [storm] Implementation of a Distributed Stream Processing framework Open-source project from Twitter Used by many companies (e.g., Twitter, Groupon, etc.) to run their business logic Source Processing Service Processing Service Processing Service Set of data sources and data processing services Connections defined by a graph in which streams are edges Source Processing Service Security - 04 Cryptology #42 21
22 Storm Streams and Tuples Stream: Unbounded sequence of tuples Tuple: Named list of values Tuple Tuple Tuple Tuple Tuple Tuple Security - 04 Cryptology #43 Storm Spouts A spout is a source of a stream E.g., a spout can read data and emit it as tuples from a File Message queue Web service RSS feed Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Figure source: Security - 04 Cryptology #44 22
23 Storm Bolt Bolts process input streams and produce new streams Bolts can implement arbitrary logic, e.g., Transformation functions Filters Aggregation functions Joins base access Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Figure source: Security - 04 Cryptology #45 Storm Topology A topology is a network of spouts and bolts Spouts and bolts are implemented in a parallelizable manner I.e., multiple instances can be spawned The framework executes them in parallel on a Storm cluster as tasks s communicate using message queues Figure source: Security - 04 Cryptology #46 23
24 Storm Example: Word Count Count words in a continuous stream of data and emit new word count for each word as soon as it changes Topology A spout reads sentences from some data source and emits them One bolt splits sentences into individual words Another bolt keeps track of the number of occurrences of individual words and emits the word and count on change (sentence, - M 3 M 2 M 1 Source (e.g. message queue) The lecturer is the ) sentences split (word, the ), (word, lecturer ), (word, is ), (word, the ) count (the, 1) (lecturer, 1) (is, 1) (the, 2) Figure source: Security - 04 Cryptology #47 Storm Example: Word Count Framework runs multiple instances of spout and bolt implementations in parallel To guarantee non-ambiguous results groupings can be defined Groupings Shuffle randomly picks a task Field makes sure one individual word always goes to the same task - M 3 M 2 M 1 Source (e.g. message queue) sentences Figure source: Shuffle grouping split (word, lecturer ), (word, the ) (word, car ), (word, drives ) Field grouping on word count Security - 04 Cryptology #48 (the, 1) (lecturer, 1) (the, 2) (is, 1) (car, 1) (drives, 1) 24
25 Storm Stream Groupings Stream groupings define to which task a tuple is sent when it is emitted Examples Shuffle & Field (previous slide) All: replicates to all tasks Global: task with lowest ID None: don t care Direct: emitter defines receiver Security - 04 Cryptology #49 Conclusion Humans create exponentially growing amounts of data Big data challenge Expressing data processing algorithms using functional primitives allows efficient scaling and distribution of computations on large (distributed) data sets Precondition: algorithm must be sufficiently parallelizable New paradigm to design applications Not only for big but also for small data Scale from start-up to global player Security - 04 Cryptology #50 25
26 Literature [mapreduce] MapReduce: Simplified Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, [storm] Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. [akka] Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. #51 26
WSO2 Message Broker. Scalable persistent Messaging System
WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable Messaging Distributed Message Brokers WSO2 MB Architecture o Distributed Pub/sub architecture o Distributed Queues architecture
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
JoramMQ, a distributed MQTT broker for the Internet of Things
JoramMQ, a distributed broker for the Internet of Things White paper and performance evaluation v1.2 September 214 mqtt.jorammq.com www.scalagent.com 1 1 Overview Message Queue Telemetry Transport () is
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Big Data Storage, Management and challenges. Ahmed Ali-Eldin
Big Data Storage, Management and challenges Ahmed Ali-Eldin (Ambitious) Plan What is Big Data? And Why talk about Big Data? How to store Big Data? BigTables (Google) Dynamo (Amazon) How to process Big
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
High Throughput Computing on P2P Networks. Carlos Pérez Miguel [email protected]
High Throughput Computing on P2P Networks Carlos Pérez Miguel [email protected] Overview High Throughput Computing Motivation All things distributed: Peer-to-peer Non structured overlays Structured
HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava
HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues Dharmit Patel Faraj Khasib Shiva Srivastava Outline What is Distributed Queue Service? Major Queue Service
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
Infrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
Lecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Andes: a highly scalable persistent messaging system
Andes: a highly scalable persistent messaging system Charith Wickramarachchi, Srinath Perera, Shammi Jayasinghe,Sanjiva Weerawarana WSO2 Inc. Mountain View, CA USA {charith, srinath, shammi, sanjiva}@wso2.com
Using Kafka to Optimize Data Movement and System Integration. Alex Holmes @
Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Building a Reliable Messaging Infrastructure with Apache ActiveMQ
Building a Reliable Messaging Infrastructure with Apache ActiveMQ Bruce Snyder IONA Technologies Bruce Synder Building a Reliable Messaging Infrastructure with Apache ActiveMQ Slide 1 Do You JMS? Bruce
Introduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
MapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
Putting Apache Kafka to Use!
Putting Apache Kafka to Use! Building a Real-time Data Platform for Event Streams! JAY KREPS, CONFLUENT! A Couple of Themes! Theme 1: Rise of Events! Theme 2: Immutability Everywhere! Level! Example! Immutable
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Classic Grid Architecture
Peer-to to-peer Grids Classic Grid Architecture Resources Database Database Netsolve Collaboration Composition Content Access Computing Security Middle Tier Brokers Service Providers Middle Tier becomes
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Introducing Storm 1 Core Storm concepts Topology design
Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Integrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
How To Write A Trusted Analytics Platform (Tap)
Trusted Analytics Platform (TAP) TAP Technical Brief October 2015 TAP Technical Brief Overview Trusted Analytics Platform (TAP) is open source software, optimized for performance and security, that accelerates
Predictive Analytics with Storm, Hadoop, R on AWS
Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Cloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
GigaSpaces Real-Time Analytics for Big Data
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
Chapter 1 - Web Server Management and Cluster Topology
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
Apache HBase. Crazy dances on the elephant back
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Wisdom from Crowds of Machines
Wisdom from Crowds of Machines Analytics and Big Data Summit September 19, 2013 Chetan Conikee Irfan Ahmad About Us CloudPhysics' mission is to discover the underlying principles that govern systems behavior
Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence
Exploring Oracle E-Business Suite Load Balancing Options Venkat Perumal IT Convergence Objectives Overview of 11i load balancing techniques Load balancing architecture Scenarios to implement Load Balancing
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Database Scalability and Oracle 12c
Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data [email protected] Warning I will be covering topics and saying things that will cause a rethink in
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Event-based middleware services
3 Event-based middleware services The term event service has different definitions. In general, an event service connects producers of information and interested consumers. The service acquires events
System Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure
Kafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
Next Generation Open Source Messaging with Apache Apollo
Next Generation Open Source Messaging with Apache Apollo Hiram Chirino Red Hat Engineer Blog: http://hiramchirino.com/blog/ Twitter: @hiramchirino GitHub: https://github.com/chirino 1 About me Hiram Chirino
Domain driven design, NoSQL and multi-model databases
Domain driven design, NoSQL and multi-model databases Java Meetup New York, 10 November 2014 Max Neunhöffer www.arangodb.com Max Neunhöffer I am a mathematician Earlier life : Research in Computer Algebra
Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
FIVE SIGNS YOU NEED HTML5 WEBSOCKETS
FIVE SIGNS YOU NEED HTML5 WEBSOCKETS A KAAZING WHITEPAPER Copyright 2011 Kaazing Corporation. All rights reserved. FIVE SIGNS YOU NEED HTML5 WEBSOCKETS A KAAZING WHITEPAPER HTML5 Web Sockets is an important
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
CSE-E5430 Scalable Cloud Computing Lecture 11
CSE-E5430 Scalable Cloud Computing Lecture 11 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 30.11-2015 1/24 Distributed Coordination Systems Consensus
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Limitations of Packet Measurement
Limitations of Packet Measurement Collect and process less information: Only collect packet headers, not payload Ignore single packets (aggregate) Ignore some packets (sampling) Make collection and processing
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Advanced Data Management Technologies
ADMT 2015/16 Unit 15 J. Gamper 1/53 Advanced Data Management Technologies Unit 15 MapReduce J. Gamper Free University of Bozen-Bolzano Faculty of Computer Science IDSE Acknowledgements: Much of the information
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus. 2010 IBM Corporation
Tomáš Müller IT Architekt 21/04/2010 ČVUT FEL: SOA & Enterprise Service Bus Agenda BPM Follow-up SOA and ESB Introduction Key SOA Terms SOA Traps ESB Core functions Products and Standards Mediation Modules
INTRODUCING APACHE IGNITE An Apache Incubator Project
WHITE PAPER BY GRIDGAIN SYSTEMS FEBRUARY 2015 INTRODUCING APACHE IGNITE An Apache Incubator Project COPYRIGHT AND TRADEMARK INFORMATION 2015 GridGain Systems. All rights reserved. This document is provided
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Big Data Storage Architecture Design in Cloud Computing
Big Data Storage Architecture Design in Cloud Computing Xuebin Chen 1, Shi Wang 1( ), Yanyan Dong 1, and Xu Wang 2 1 College of Science, North China University of Science and Technology, Tangshan, Hebei,
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON
BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON Overview * Introduction * Multiple faces of Big Data * Challenges of Big Data * Cloud Computing
NOT IN KANSAS ANY MORE
NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky
Apache Kafka Your Event Stream Processing Solution
01 0110 0001 01101 Apache Kafka Your Event Stream Processing Solution White Paper www.htcinc.com Contents 1. Introduction... 2 1.1 What are Business Events?... 2 1.2 What is a Business Data Feed?... 2
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
A standards-based approach to application integration
A standards-based approach to application integration An introduction to IBM s WebSphere ESB product Jim MacNair Senior Consulting IT Specialist [email protected] Copyright IBM Corporation 2005. All rights
