Chapter 5: Stream Processing. Big Data Management and Analytics 193
|
|
|
- Horace Fields
- 9 years ago
- Views:
Transcription
1 Chapter 5: Big Data Management and Analytics 193
2 Today s Lesson Data Streams & Data Stream Management System Data Stream Models Insert-Only Insert-Delete Additive Streaming Methods Sliding Windows & Ageing Data Synopsis Concepts & Tools Micro-Batching with Apache Spark Streaming Real-time with Apache Storm Big Data Management and Analytics 194
3 Data Streams Big Data Management and Analytics 195
4 Data Streams Definition: A data stream can be seen as a continuous and potentially infinite stochastic process in which events occur independently from another Huge amount of data Data objects cannot be stored Single scan Big Data Management and Analytics 196
5 Data Streams Key Characteristics The data elements in the stream arrive on-line The system has no control over the order in which data elements arrive (either within a data stream or across multiple data streams) Data streams are potentially unbound in size Once an element has been processed it is discarded or archived Big Data Management and Analytics 197
6 Data Stream Management System Ad-hoc queries Data Streams time Stream Processor Standing query Output Streams Limited working storage Archival Storage Big Data Management and Analytics 198
7 Data Stream Models Insert-Only Model Once an element is seen, it cannot be changed Stream Processor Stream Processor time Big Data Management and Analytics 199
8 Data Stream Models Insert-Delete Model Elements can be deleted or updated Stream Processor 3 2 Stream Processor 4 4 time Big Data Management and Analytics 200
9 Data Stream Models Additive Model Each element is an increment to the previous version of the given data object 2 3 Stream Processor 2 Stream Processor 3 time Big Data Management and Analytics 201
10 Streaming Methods Huge amount of data vs. limited resources in space impractical to store all data Solutions: Storing summaries of previously seen data Forgetting stale data But: Trade-off between storage space and the ability to provide precise query answers Big Data Management and Analytics 202
11 Streaming Methods Sliding Windows Idea: Keep most recent stream elements in main memory and discard older ones Timestamp-based: Data Stream Sliding interval Window length Big Data Management and Analytics 203
12 Streaming Methods Sliding Windows Idea: Keep most recent stream elements in main memory and discard older ones Sequence-based: Data Stream Sliding interval Window length Big Data Management and Analytics 204
13 Streaming Methods Ageing Idea: Keep only the summary in main memory and discard objects as soon as they are processed Data Stream Multiply the summary with a decay factor after each time epoche, resp. after a certain amount of occuring elements Big Data Management and Analytics 205
14 Streaming Methods High velocity of incoming data vs. limited resources in time impossible to process all data Solutions: Data reduction Data approximation But: Trade-off between processing speed and the ability to provide precise query answers Big Data Management and Analytics 206
15 Streaming Methods Sampling Select a subset of the data Reduce the amount of data to process Difficulty: Obtaining a representative sample Simplest form: random sampling Reservoir Sampling Min-Wise Sampling Reservoir Sampling Algorithm input: Stream, Size of reservoir begin Insert first objects into reservoir; foreach do Let be the position of ; random integer in range 1.. ; if M then Insert into reservoir; Delete an instance from the reservoir at random; Load Shedding: Discard some fractions of data if the arrival rate of the stream might overload the system Big Data Management and Analytics 207
16 Streaming Methods Data Synopsis & Histograms Summaries of data objects oftenly used to reduce the amount of data e.g. Microclusters that describe groups of similar objects Histograms are used to approximate the frequency distribution of element values Commonly used for query optimizers (e.g. range queries) Big Data Management and Analytics 208
17 Overview of techniques to build a summary (reduced representation) of a sequence of numeric attributes: DFT DWT SVD APCA PAA PLA Big Data Management and Analytics 209
18 Diskrete Wavelet Transformation (DWT) Idea: Sequence represented as linear combination of basic wavelet functions Wavelet transformation decomposes a signal into several groups of coefficients at different scales Small coefficients can be eliminated Small errors when reconstructing the signal Take only the first function coefficents Often: Haar-wavelets used (easy to implement) DWT Big Data Management and Analytics 210 X X' Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7
19 Example: Step-wise transformation of sequence(stream) X=<8,4,1,3> into Haar-wavelet representation H=[4,2,2,-1] h X = {8, 4, 1, 3} 3 = 2 = h 4 = -1 = (8-4)/2 (1-3)/ h 1 = 4 = mean(8,4,1, 3) h 2 = 2 = mean(8,4) - h 1 (Lossless) Reconstruction of original sequence (stream) from Haar-wavelet representation: h 1 = 4 h 2 = 2 h 3 = 2 h 4 = -1 X = {8, 4, 1, 3} Big Data Management and Analytics 211
20 Haar Wavelet Transformation Input sequence: Haar Wavelet Transform Algorithm input: Sequence S,,,, of even length output: Sequence of wavelet coefficients begin Transform into a sequence of two component vectors,,,, where ; Separate the sequences and ; Recursively transform sequence ; Step 1: 2 5,8 9,7 4, 1 1 /2, 2 5,8 9,7 4, 1 1 /2 3.5, 8.5, 5.5, 0, 1.5, 0.5, 1.5, 1 Step 2: , /2, , /2 6, 2.75, 2.5,2.75 Step 3: /2, / , Wavelet coefficients 4.375, 1.625, 2.5, 2.75, 1.5, 0.5, 1.5, 1 Big Data Management and Analytics 212
21 Spark Streaming Spark s Streaming Framework build on top of Spark s Core API Data ingestion from several different data sources Stream processing might be combined with other Spark libraries (e.g. Spark Mllib) Big Data Management and Analytics 213
22 Spark Streaming Spark s Streaming Workflow: Streaming engine receives data from input streams Data stream is divided into several microbatches, i.e. sequences of RDDs Microbatches are processed by Spark engine The result is a data stream of batches of processed data Big Data Management and Analytics 214
23 Spark Streaming DStreams (Discretized Streams) as basic abstraction Any operation applied on a DStream translates to operations on the underlying RDDs (computed by Spark Engine) StreamingContext objects as starting points sc = SparkContext(master, appname) ssc = StreamingContext(sc, 1) #params: SparkContext, time interval Big Data Management and Analytics 215
24 Spark Streaming General schedule for a Spark Streaming application: 1. Define the StreamingContext ssc 2. Define the input sources by creating input DStreams 3. Define the streaming computations by applying transformations and output operations to Dstreams 4. Start receiving data and processing it using ssc.start() 5. Wait for the processing to be stopped (manually or due to any error) using ssc.awaittermination() 6. The processing can be manually stopped using ssc.stop() Big Data Management and Analytics 216
25 Spark Streaming #Create a local StreamingContext with two working threads and batch #interval of 1 sec sc = SparkContext( local[2], NetworkWordCount ) ssc = StreamingContext(sc, 1) #Create a DStream that will connect to localhost:9999 lines = ssc.sockettextstream( localhost, 9999) #Split each line into words words = lines.flatmap(lambda line: line.split( )) #Count each word in each batch pairs = words.map(lambda word: (word,1)) wordcounts = pairs.reducebykey(lambda x, y: x + y) #Print the first ten elements of each RDD of this DStream to the console wordcounts.pprint() #Start the computation and wait for it to terminate ssc.start() ssc.awaittermination() Big Data Management and Analytics 217
26 Spark Streaming Support of window operations Two basic parameters: windowlength slideinterval Support of many transformations for windowed DStreams #Reduce last 30 sec of data, every 10 sec winwordcounts = pairs.reducebykeyandwindow(lambda x,y: x+y, 30, 10) Big Data Management and Analytics 218
27 Apache Storm Alternative to Spark Streaming Support of Real-time Processing Three abstractions: Spouts Bolts Topologies Big Data Management and Analytics 219
28 Apache Storm Spouts: Source of streams Typically reads from queuing brokers (e.g. Kafka, RabbitMQ) Can also generate its own data or read from external sources (e.g. Twitter) Bolts: Processes any number of input streams Produces any number of output streams Holds most of the logic of the computations (functions, filters, ) Big Data Management and Analytics 220
29 Apache Storm Topologies: Network of spouts and bolts Each edge represents a bolt subscribing to the output stream of some other spout or bolt A topology is an arbitrarily complex multi-stage stream computation Big Data Management and Analytics 221
30 Apache Storm Streams: Core abstraction in Storm A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion Tuples can contain standard types like integers, floats, shorts, booleans, strings and so on Custom types can be used if a own serializer is defined A stream grouping defines how that stream should be partitioned among the bolt's tasks Big Data Management and Analytics 222
31 Apache Storm Config conf = new Config(); conf.setnumworkers(2); // use two worker processes Spout Bolt Bolt topologybuilder.setspout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2 topologybuilder.setbolt("green-bolt", new GreenBolt(), 2).setNumTasks(4).shuffleGrouping("blue-spout"); // 4 Tasks spread across 2 Executors and the // tuples shall be randomly distributed across // the bolt s tasks, each bolt shall get an // equal number of tuples topologybuilder.setbolt("yellow-bolt", new YellowBolt(), 6).shuffleGrouping("green-bolt"); StormSubmitter.submitTopology( "mytopology", conf, topologybuilder.createtopology() ); Worker Process Executor Task Executor Task Executor Task Executor Task Task Executor Task TOPOLOGY Worker Process Executor Task Executor Task Executor Task Executor Task Task Executor Task Big Data Management and Analytics 223
32 Further Reading Joao Gama: Knowledge Discovery from Data Streams ( Jure Leskovec, Anand Rajaraman, Jeff Ullman: Mining of Massive Datasets Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Learning Spark - Lightning-Fast Big Data Analysis Big Data Management and Analytics 224
Architectures for massive data management
Architectures for massive data management Apache Spark Albert Bifet [email protected] October 20, 2015 Spark Motivation Apache Spark Figure: IBM and Apache Spark What is Apache Spark Apache
Streaming items through a cluster with Spark Streaming
Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? > Project Management Committee (PMC) member
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1
Spark ΕΡΓΑΣΤΗΡΙΟ 10 Prepared by George Nikolaides 4/19/2015 1 Introduction to Apache Spark Another cluster computing framework Developed in the AMPLab at UC Berkeley Started in 2009 Open-sourced in 2010
Big Data Analytics Hadoop and Spark
Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software
Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics
E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big
Resource Aware Scheduler for Storm. Software Design Document. <[email protected]> Date: 09/18/2015
Resource Aware Scheduler for Storm Software Design Document Author: Boyang Jerry Peng Date: 09/18/2015 Table of Contents 1. INTRODUCTION 3 1.1. USING
Spark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
SPARK USE CASE IN TELCO. Apache Spark Night 9-2-2014! Chance Coble!
SPARK USE CASE IN TELCO Apache Spark Night 9-2-2014! Chance Coble! Use Case Profile Telecommunications company Shared business problems/pain Scalable analytics infrastructure is a problem Pushing infrastructure
Big Data Analytics. Lucas Rego Drumond
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Apache Spark Apache Spark 1
Real-time Big Data Analytics with Storm
Ron Bodkin Founder & CEO, Think Big June 2013 Real-time Big Data Analytics with Storm Leading Provider of Data Science and Engineering Services Accelerating Your Time to Value IMAGINE Strategy and Roadmap
The basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
Openbus Documentation
Openbus Documentation Release 1 Produban February 17, 2014 Contents i ii An open source architecture able to process the massive amount of events that occur in a banking IT Infraestructure. Contents:
Machine- Learning Summer School - 2015
Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
SoSe 2014: M-TANI: Big Data Analytics
SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
Big Data Analytics with Cassandra, Spark & MLLib
Big Data Analytics with Cassandra, Spark & MLLib Matthias Niehoff AGENDA Spark Basics In A Cluster Cassandra Spark Connector Use Cases Spark Streaming Spark SQL Spark MLLib Live Demo CQL QUERYING LANGUAGE
Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA
Real Time Fraud Detection With Sequence Mining on Big Data Platform Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May 6 2014 Santa Clara, CA Open Source Big Data Eco System Query (NOSQL) : Cassandra,
Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack
Apache Spark Document Analysis Course (Fall 2015 - Scott Sanner) Zahra Iman Some slides from (Matei Zaharia, UC Berkeley / MIT& Harold Liu) Reminder SparkConf JavaSpark RDD: Resilient Distributed Datasets
Big Data Frameworks: Scala and Spark Tutorial
Big Data Frameworks: Scala and Spark Tutorial 13.03.2015 Eemil Lagerspetz, Ella Peltonen Professor Sasu Tarkoma These slides: http://is.gd/bigdatascala www.cs.helsinki.fi Functional Programming Functional
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Big Data Processing. Patrick Wendell Databricks
Big Data Processing Patrick Wendell Databricks About me Committer and PMC member of Apache Spark Former PhD student at Berkeley Left Berkeley to help found Databricks Now managing open source work at Databricks
Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
Big Data Analysis: Apache Storm Perspective
Big Data Analysis: Apache Storm Perspective Muhammad Hussain Iqbal 1, Tariq Rahim Soomro 2 Faculty of Computing, SZABIST Dubai Abstract the boom in the technology has resulted in emergence of new concepts
Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
Rakam: Distributed Analytics API
Rakam: Distributed Analytics API Burak Emre Kabakcı May 30, 2014 Abstract Today, most of the big data applications needs to compute data in real-time since the Internet develops quite fast and the users
Unified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm
Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm Lokesh Babu Rao 1 C. Elayaraja 2 1PG Student, Dept. of ECE, Dhaanish Ahmed College of Engineering,
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC [email protected] http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP
LAB 2 SPARK / D-STREAM PROGRAMMING SCIENTIFIC APPLICATIONS FOR IOT WORKSHOP ICTP, Trieste, March 24th 2015 The objectives of this session are: Understand the Spark RDD programming model Familiarize with
Data Science in the Wild
Data Science in the Wild Lecture 4 59 Apache Spark 60 1 What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs
1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE
Introduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Programming Spark Resilient Distributed Datasets (RDDs) Creating an RDD Spark Transformations and Actions Spark Programming Model Python
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming Two Stream Platforms compared DBTA Workshop on Stream Berne, 3.1.014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH
Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY
Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane
BIG DATA Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management Author: Sandesh Deshmane Executive Summary Growing data volumes and real time decision making requirements
Log Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, [email protected] Amruta Deshpande Department of Computer Science, [email protected]
Spark. Fast, Interactive, Language- Integrated Cluster Computing
Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
CAPTURING & PROCESSING REAL-TIME DATA ON AWS
CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent
BIG DATA ANALYTICS For REAL TIME SYSTEM
BIG DATA ANALYTICS For REAL TIME SYSTEM Where does big data come from? Big Data is often boiled down to three main varieties: Transactional data these include data from invoices, payment orders, storage
Real Time Data Processing using Spark Streaming
Real Time Data Processing using Spark Streaming Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O
Introduction to Spark
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks) Quick Demo Quick Demo API Hooks Scala / Java All Java libraries *.jar http://www.scala- lang.org Python Anaconda: https://
Mining Social Network Graphs
Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
THEMIS: Fairness in Data Stream Processing under Overload
THEMIS: Fairness in Data Stream Processing under Overload Evangelia Kalyvianaki City University London, UK Marco Fiscato Imperial College London, UK Theodoros Salonidis IBM Research, USA Peter R. Pietzuch
White Paper. How Streaming Data Analytics Enables Real-Time Decisions
White Paper How Streaming Data Analytics Enables Real-Time Decisions Contents Introduction... 1 What Is Streaming Analytics?... 1 How Does SAS Event Stream Processing Work?... 2 Overview...2 Event Stream
Kafka & Redis for Big Data Solutions
Kafka & Redis for Big Data Solutions Christopher Curtin Head of Technical Research @ChrisCurtin About Me 25+ years in technology Head of Technical Research at Silverpop, an IBM Company (14 + years at Silverpop)
FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara
CS535 Big Data - Fall 2015 W1.B.1 CS535 Big Data - Fall 2015 W1.B.2 CS535 BIG DATA FAQs Wait list Term project topics PART 0. INTRODUCTION 2. A PARADIGM FOR BIG DATA Sangmi Lee Pallickara Computer Science,
Data Management in the Cloud
Data Management in the Cloud Ryan Stern [email protected] : Advanced Topics in Distributed Systems Department of Computer Science Colorado State University Outline Today Microsoft Cloud SQL Server
Online and Scalable Data Validation in Advanced Metering Infrastructures
Online and Scalable Data Validation in Advanced Metering Infrastructures Chalmers University of technology Agenda 1. Problem statement 2. Preliminaries Data Streaming 3. Streaming-based Data Validation
Future Internet Technologies
Future Internet Technologies Big (?) Processing Dr. Dennis Pfisterer Institut für Telematik, Universität zu Lübeck http://www.itm.uni-luebeck.de/people/pfisterer FIT Until Now Architectures -Server SPDY
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya
Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming by Dibyendu Bhattacharya Pearson : What We Do? We are building a scalable, reliable cloud-based learning platform providing services
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.
GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC. Connected Cars Driving Us to a Better Us - In Real Time What is a Connected Car? Connected Car - Definition A connected car is a car that
Brave New World: Hadoop vs. Spark
Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,
Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations
Beyond Lambda - how to get from logical to physical Artur Borycki, Director International Technology & Innovations Simplification & Efficiency Teradata believe in the principles of self-service, automation
Going Deep with Spark Streaming
Going Deep with Spark Streaming Andrew Psaltis (@itmdata) ApacheCon, April 16, 2015 Outline Introduction DStreams Thinking about time Recovery and Fault tolerance Conclusion About Me Andrew Psaltis Data
Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015
Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours
Apache Kafka Your Event Stream Processing Solution
01 0110 0001 01101 Apache Kafka Your Event Stream Processing Solution White Paper www.htcinc.com Contents 1. Introduction... 2 1.1 What are Business Events?... 2 1.2 What is a Business Data Feed?... 2
Amazon Kinesis and Apache Storm
Amazon Kinesis and Apache Storm Building a Real-Time Sliding-Window Dashboard over Streaming Data Rahul Bhartia October 2014 Contents Contents Abstract Introduction Reference Architecture Amazon Kinesis
Big Data Management and Analytics
Big Data Management and Analytics Lecture Notes Winter semester 2015 / 2016 Ludwig-Maximilians-University Munich Prof. Dr. Matthias Renz 2015 Based on lectures by Donald Kossmann (ETH Zürich), as well
THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS
THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis
Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets Andrew Psaltis About Me Recently started working at Ensighten on Agile Maketing Platform Prior 4.5 years
How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
Estimating PageRank Values of Wikipedia Articles using MapReduce
Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html
Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1
Binary Number System 1 Base 10 digits: 0 1 2 3 4 5 6 7 8 9 Base 2 digits: 0 1 Recall that in base 10, the digits of a number are just coefficients of powers of the base (10): 417 = 4 * 10 2 + 1 * 10 1
Beyond Hadoop with Apache Spark and BDAS
Beyond Hadoop with Apache Spark and BDAS Khanderao Kand Principal Technologist, Guavus 12 April GITPRO World 2014 Palo Alto, CA Credit: Some stajsjcs and content came from presentajons from publicly shared
Learning. Spark LIGHTNING-FAST DATA ANALYTICS. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia
Compliments of Learning Spark LIGHTNING-FAST DATA ANALYTICS Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia Bring Your Big Data to Life Big Data Integration and Analytics Learn how to power
Introducing Storm 1 Core Storm concepts Topology design
Storm Applied brief contents 1 Introducing Storm 1 2 Core Storm concepts 12 3 Topology design 33 4 Creating robust topologies 76 5 Moving from local to remote topologies 102 6 Tuning in Storm 130 7 Resource
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
MapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
Big Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai [email protected] Intel Software and Services Group
Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai [email protected] Intel Software and Services Group Project Overview Research & open source projects initiated by AMPLab in UC Berkeley
HADOOP. Revised 10/19/2015
HADOOP Revised 10/19/2015 This Page Intentionally Left Blank Table of Contents Hortonworks HDP Developer: Java... 1 Hortonworks HDP Developer: Apache Pig and Hive... 2 Hortonworks HDP Developer: Windows...
SEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation
Introduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
Mining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
