Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter
|
|
- Gabriel Page
- 7 years ago
- Views:
Transcription
1 Storm Distributed and fault-tolerant realtime computation Nathan Marz Twitter
2 Basic info Open sourced September 19th Implementation is 15,000 lines of code Used by over 25 companies >2400 watchers on Github (most watched JVM project) Very active mailing list >1800 messages >560 members
3 Before Storm Queues Workers
4 Example (simplified)
5 Example Workers schemify tweets and append to Hadoop
6 Example Workers update statistics on URLs by incrementing counters in Cassandra
7 Scaling Deploy Reconfigure/redeploy
8 Problems Scaling is painful Poor fault-tolerance Coding is tedious
9 What we want Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works
10 Storm Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message passing Just works
11 Use cases Stream processing Distributed RPC Continuous computation
12 Storm Cluster
13 Storm Cluster Master node (similar to Hadoop JobTracker)
14 Storm Cluster Used for cluster coordination
15 Storm Cluster Run worker processes
16 Starting a topology
17 Killing a topology
18 Concepts Streams Spouts Bolts Topologies
19 Streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples
20 Spouts Source of streams
21 Spout examples Read from Kestrel queue Read from Twitter streaming API
22 Bolts Processes input streams and produces new streams
23 Bolts Functions Filters Aggregation Joins Talk to databases
24 Topology Network of spouts and bolts
25 Tasks Spouts and bolts execute as many tasks across the cluster
26 Task execution Tasks are spread across the cluster
27 Task execution Tasks are spread across the cluster
28 Stream grouping When a tuple is emitted, which task does it go to?
29 Stream grouping Shuffle grouping: pick a random task Fields grouping: mod hashing on a subset of tuple fields All grouping: send to all tasks Global grouping: pick task with lowest id
30 Topology shuffle [ id1, id2 ] shuffle [ url ] shuffle all
31 Streaming word count TopologyBuilder is used to construct topologies in Java
32 Streaming word count Define a spout in the topology with parallelism of 5 tasks
33 Streaming word count Split sentences into words with parallelism of 8 tasks
34 Streaming word count Consumer decides what data it receives and how it gets grouped Split sentences into words with parallelism of 8 tasks
35 Streaming word count Create a word count stream
36 Streaming word count splitsentence.py
37 Streaming word count
38 Streaming word count Submitting topology to a cluster
39 Streaming word count Running topology in local mode
40 Demo
41 Distributed RPC Data flow for Distributed RPC
42 DRPC Example Computing reach of a URL on the fly
43 Reach Reach is the number of unique people exposed to a URL on Twitter
44 Computing reach Tweeter Follower Follower Distinct follower URL Tweeter Follower Follower Distinct follower Count Reach Tweeter Follower Follower Distinct follower
45 Reach topology
46 Reach topology
47 Reach topology
48 Reach topology Keep set of followers for each request id in memory
49 Reach topology Update followers set when receive a new follower
50 Reach topology Emit partial count after receiving all followers for a request id
51 Demo
52 Guaranteeing message processing Tuple tree
53 Guaranteeing message processing A spout tuple is not fully processed until all tuples in the tree have been completed
54 Guaranteeing message processing If the tuple tree is not completed within a specified timeout, the spout tuple is replayed
55 Guaranteeing message processing Reliability API
56 Guaranteeing message processing Anchoring creates a new edge in the tuple tree
57 Guaranteeing message processing Marks a single node in the tree as complete
58 Guaranteeing message processing Storm tracks tuple trees for you in an extremely efficient way
59 Transactional topologies How do you do idempotent counting with an at least once delivery guarantee?
60 Transactional topologies Won t you overcount?
61 Transactional topologies Transactional topologies solve this problem
62 Transactional topologies Built completely on top of Storm s primitives of streams, spouts, and bolts
63 Transactional topologies Batch 1 Batch 2 Batch 3 Process small batches of tuples
64 Transactional topologies Batch 1 Batch 2 Batch 3 If a batch fails, replay the whole batch
65 Transactional topologies Batch 1 Batch 2 Batch 3 Once a batch is completed, commit the batch
66 Transactional topologies Batch 1 Batch 2 Batch 3 Bolts can optionally be committers
67 Transactional topologies Commit 1 Commit 1 Commit 2 Commit 3 Commit 4 Commit 4 Commits are ordered. If there s a failure during commit, the whole batch + commit is retried
68 Example
69 Example New instance of this object for every transaction attempt
70 Example Aggregate the count for this batch
71 Example Only update database if transaction ids differ
72 Example This enables idempotency since commits are ordered
73 Example (Credit goes to Kafka devs for this trick)
74 Transactional topologies Multiple batches can be processed in parallel, but commits are guaranteed to be ordered
75 Transactional topologies Will be available in next version of Storm (0.7.0) Requires a source queue that can replay identical batches of messages storm-kafka has a transactional spout implementation for Kafka
76 Storm UI
77 Storm on EC2 One-click deploy tool
78 Starter code Example topologies
79 Documentation
80 Ecosystem Scala, JRuby, and Clojure DSL s Kestrel, AMQP, JMS, and other spout adapters Serializers Multilang adapters Cassandra, MongoDB integration
81 Questions?
82 Future work State spout Storm on Mesos Swapping Auto-scaling Higher level abstractions
83 Implementation KafkaTransactionalSpout
84 Implementation all all all
85 Implementation all all TransactionalSpout is a subtopology consisting of a spout and a bolt all
86 Implementation all all The spout consists of one task that coordinates the transactions all
87 Implementation all all all The bolt emits the batches of tuples
88 Implementation all all The coordinator emits a batch stream and a commit stream all
89 Implementation all all all Batch stream
90 Implementation all all all Commit stream
91 Implementation all all Coordinator reuses tuple tree framework to detect success or failure of batches or commits and replays appropriately all
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
More informationReal-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
More informationIntroducing Storm 1 Core Storm concepts Topology design
Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource
More informationArchitectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationFuture Internet Technologies
Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY
More informationApache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationPredictive Analytics with Storm, Hadoop, R on AWS
Douglas Moore Principal Consultant & Architect February 2013 Predictive Analytics with Storm, Hadoop, R on AWS Leading Provider Data Science and Engineering Services Accelerating Your Time to Value using
More informationReal Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
More informationFAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara
CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,
More informationA stream computing approach towards scalable NLP
A stream computing approach towards scalable NLP Xabier Artola, Zuhaitz Beloki, Aitor Soroa IXA group. University of the Basque Country. LREC, Reykjavík 2014 Table of contents 1
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationRakam: Distributed Analytics API
Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationData Stream Algorithms in Storm and R. Radek Maciaszek
Data Stream Algorithms in Storm and R Radek Maciaszek Who Am I? l Radek Maciaszek l l l l l l Consul9ng at DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.
More informationYARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing
YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing Eric Charles [http://echarles.net] @echarles Datalayer [http://datalayer.io] @datalayerio FOSDEM 02 Feb 2014 NoSQL DevRoom
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationBig Data Analysis: Apache Storm Perspective
Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationBuilding Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationSTREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day
STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA Processing billions of events every day Neha Narkhede Co-founder and Head of Engineering @ Stealth Startup Prior to this Lead, Streams Infrastructure
More informationBig Data. A general approach to process external multimedia datasets. David Mera
Big Data A general approach to process external multimedia datasets David Mera Laboratory of Data Intensive Systems and Applications (DISA) Masaryk University Brno, Czech Republic 7/10/2014 Table of Contents
More informationBIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
More informationANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE
ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE Dibyendu Bhattacharya Architect-Big Data Analytics HappiestMinds Manidipa Mitra Principal Software Engineer EMC Table of Contents
More informationProgramming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationBig Data Analytics for Cyber
Big Data Analytics for Cyber AFCEA International Cyber Symposium June 24, 2014 Jon Lau, Vice President and CTO UMBC Training Centers 6/26/2014 umbctraining.com 443-692-6600 1 Agenda About UMBC & UMBC Training
More informationTechnical Report. A Survey of the Stream Processing Landscape
A Survey of the Stream Processing Landscape Version: 1.0 May 16, 2014 Technical Report Christian Bockermann Lehrstuhl für künstliche Intelligenz Technische Universität Dortmund christian.bockermann@udo.edu
More information3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS
. 3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade
More informationArchitectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
More informationAgenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:
Cloud (data) management Ahmed Ali-Eldin First part: ZooKeeper (Yahoo!) Agenda A highly available, scalable, distributed, configuration, consensus, group membership, leader election, naming, and coordination
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationBig Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
More informationBig Data Pipeline and Analytics Platform
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Source Software Sudhir Tonse (@stonse) Danny Yuan (@g9yuayon) Netflix is a log generating company that also happens to stream movies
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationSpark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
More informationResource Aware Scheduler for Storm. Software Design Document. <jerry.boyang.peng@gmail.com> Date: 09/18/2015
Resource Aware Scheduler for Storm Software Design Document Author: Boyang Jerry Peng Date: 09/18/2015 Table of Contents 1. INTRODUCTION 3 1.1. USING
More informationHow To Create A Query Engine That Can Do In-Stream Processing
In-Stream Big Data Processing The shortcomings and drawbacks of batch-oriented data processing were widely recognized by the Big Data community quite a long time ago. It became clear that realtime query
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
More informationNOT IN KANSAS ANY MORE
NOT IN KANSAS ANY MORE How we moved into Big Data Dan Taylor - JDSU Dan Taylor Dan Taylor: An Engineering Manager, Software Developer, data enthusiast and advocate of all things Agile. I m currently lucky
More informationMesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationThe Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
More informationScaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13
Scaling Pinterest Yash Nelapati Ascii Artist Pinterest is... An online pinboard to organize and share what inspires you. Growth March 2010 Page views per day Mar 2010 Jan 2011 Jan 2012 May 2012 Growth
More informationBeyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
More informationElephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets
Paper DH07 Elephants and Storms - using Big Data techniques for Analysis of Large and Changing Datasets Geoff Low, Medidata Solutions, London, United Kingdom ABSTRACT As an industry we are data-led. We
More informationA framework for easy development of Big Data applications
A framework for easy development of Big Data applications Rubén Casado ruben.casado@treelogic.com @ruben_casado Agenda 1. Big Data processing 2. Lambdoop framework 3. Lambdoop ecosystem 4. Case studies
More informationOnline and Scalable Data Validation in Advanced Metering Infrastructures
Online and Scalable Data Validation in Advanced Metering Infrastructures Chalmers University of technology Agenda 1. Problem statement 2. Preliminaries Data Streaming 3. Streaming-based Data Validation
More informationSQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS
Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical
More informationBIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO
BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO ANTHONY A. KALINDE SIGMA DATA SCIENCE GROUP ASSOCIATE "REALTIME BEHAVIOURAL DATA COLLECTION CLICKSTREAM EXAMPLE" WHAT IS CLICKSTREAM ANALYTICS?
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationData Pipeline with Kafka
Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66 AGENDA Big Data & Data Pipeline Kafka Introduction
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationSCIMITAR: Scalable Stream-Processing for Sensor Information Brokering
2013 IEEE Military Communications Conference SCIMITAR: Scalable Stream-Processing for Sensor Information Brokering Kurt Rohloff, Jeffrey Cleveland, Joseph Loyall Raytheon BBN Technologies Cambridge, MA,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationSpark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
More informationCS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
More informationChapter 5: Stream Processing. Big Data Management and Analytics 193
Chapter 5: Big Data Management and Analytics 193 Today s Lesson Data Streams & Data Stream Management System Data Stream Models Insert-Only Insert-Delete Additive Streaming Methods Sliding Windows & Ageing
More informationStreaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
More informationUsing Kafka to Optimize Data Movement and System Integration. Alex Holmes @
Using Kafka to Optimize Data Movement and System Integration Alex Holmes @ https://www.flickr.com/photos/tom_bennett/7095600611 THIS SUCKS E T (circa 2560 B.C.E.) L a few years later... 2,014 C.E. i need
More informationPulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationScaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf
Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationReal Time Analytics for Big Data. NtiSh Nati Shalom @natishalom
Real Time Analytics for Big Data A Twitter Inspired Case Study NtiSh Nati Shalom @natishalom Big Data Predictions Overthe next few years we'll see the adoption of scalable frameworks and platforms for
More informationxpaaerns on Spark, Shark, Tachyon and Mesos
xpaaerns on Spark, Shark, Tachyon and Mesos Spark Summit 2014 Claudiu Barbura Sr. Director of Engineering A>geo Agenda xpa&erns Architecture From Hadoop to BDAS & our contribu
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationWorkflow Tools at NERSC. Debbie Bard djbard@lbl.gov NERSC Data and Analytics Services
Workflow Tools at NERSC Debbie Bard djbard@lbl.gov NERSC Data and Analytics Services NERSC User Meeting August 13th, 2015 What Does Workflow Software Do? Automate connection of applications Chain together
More informationCreating Big Data Applications with Spring XD
Creating Big Data Applications with Spring XD Thomas Darimont @thomasdarimont THE FASTEST PATH TO NEW BUSINESS VALUE Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these
More informationUnified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
More informationWSO2 Message Broker. Scalable persistent Messaging System
WSO2 Message Broker Scalable persistent Messaging System Outline Messaging Scalable Messaging Distributed Message Brokers WSO2 MB Architecture o Distributed Pub/sub architecture o Distributed Queues architecture
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationWhat it is and why you might use it
What it is and why you might use it Richard Downer richard@apache.org Presented at ApacheCon Europe 2014 Hello to those watching from home. The speaker s notes on most slides will provide more information
More informationDistributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms
Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes
More informationDRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING
DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure
More informationApache Flink. Fast and Reliable Large-Scale Data Processing
Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System Focused on large-scale data analytics Real-time stream
More informationIn-Memory BigData. Summer 2012, Technology Overview
In-Memory BigData Summer 2012, Technology Overview Company Vision In-Memory Data Processing Leader: > 5 years in production > 100s of customers > Starts every 10 secs worldwide > Over 10,000,000 starts
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationCazando cibercriminales con: OSINT + Cloud Computing + Big Data. Chema Alonso (@chemaalonso)
Cazando cibercriminales con: OSINT + Cloud Computing + Big Data Chema Alonso (@chemaalonso) Problem: Cybercrime in Android Problem: Cybercrime in Google Play Intelligence & Security OSINT (Open Source
More informationDesigning Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
More informationPERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE
PERFORMANCE ENHANCEMENT OF BIG DATA PROCESSING IN HADOOP MAP/REDUCE A report submitted in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY in COMPUTER SCIENCE
More informationReal Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationBig Data, Simple and Fast:
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop Jacek Kruszelnicki, Numatica Corporation E-mail: j a c e k@numatica.com (remove spaces) Phone: 781 756 8064 1 Presenter Jacek Kruszelnicki
More informationInternals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBig Graph Analytics on Neo4j with Apache Spark. Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage
Big Graph Analytics on Neo4j with Apache Spark Michael Hunger Original work by Kenny Bastani Berlin Buzzwords, Open Stage My background I only make it to the Open Stages :) Probably because Apache Neo4j
More informationHadoop: The Definitive Guide
FOURTH EDITION Hadoop: The Definitive Guide Tom White Beijing Cambridge Famham Koln Sebastopol Tokyo O'REILLY Table of Contents Foreword Preface xvii xix Part I. Hadoop Fundamentals 1. Meet Hadoop 3 Data!
More informationthe missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
More informationCouchbase Server Under the Hood
Couchbase Server Under the Hood An Architectural Overview Couchbase Server is an open-source distributed NoSQL document-oriented database for interactive applications, uniquely suited for those needing
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More information