Data Stream Algorithms in Storm and R. Radek Maciaszek



Similar documents
Real-time Big Data Analytics with Storm

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Introducing Storm 1 Core Storm concepts Topology design

Predictive Analytics with Storm, Hadoop, R on AWS

NextGen Infrastructure for Big DATA Analytics.

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Using RDBMS, NoSQL or Hadoop?

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Ins+tuto Superior Técnico Technical University of Lisbon. Big Data. Bruno Lopes Catarina Moreira João Pinho

Infrastructures for big data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

The Internet of Things and Big Data: Intro

Openbus Documentation

Apache Hadoop. Alexandru Costan

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Scale Out Of A Nosql Database

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

NOT IN KANSAS ANY MORE

Ibis: Scaling Python Analy=cs on Hadoop and Impala

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduc8on to Apache Spark

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Architectures for massive data management

Massive Cloud Auditing using Data Mining on Hadoop

Apache HBase. Crazy dances on the elephant back

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

How To Use Big Data For Telco (For A Telco)

Analyzing Big Data with AWS

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

DNS Big Data

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

The 3 questions to ask yourself about BIG DATA

Rakam: Distributed Analytics API

Hadoop and Map-Reduce. Swati Gore

Kafka & Redis for Big Data Solutions

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

ANALYTICS CENTER LEARNING PROGRAM

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

NoSQL Data Base Basics

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Cyber Security With Big Data

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

WSO2 Message Broker. Scalable persistent Messaging System

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Challenges for Data Driven Systems

Run$me Query Op$miza$on

Unified Batch & Stream Processing Platform

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Scalable Architecture on Amazon AWS Cloud

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Apache Spark and the future of big data applica5ons. Eric Baldeschwieler

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop & Spark Using Amazon EMR

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

MapReduce with Apache Hadoop Analysing Big Data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

The 4 Pillars of Technosoft s Big Data Practice

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

#TalendSandbox for Big Data

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Big Data Analysis: Apache Storm Perspective

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

BENCHMARKING V ISUALIZATION TOOL

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

HADOOP. Revised 10/19/2015

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

BIG DATA What it is and how to use?

How To Create A Data Visualization With Apache Spark And Zeppelin

Privacy- Preserving P2P Data Sharing with OneSwarm. Presented by. Adnan Malik

Retaining globally distributed high availability Art van Scheppingen Head of Database Engineering

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Transforming the Telecoms Business using Big Data and Analytics

Sentimental Analysis using Hadoop Phase 2: Week 2

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Real Time Big Data Processing

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Open Source Technologies on Microsoft Azure

SQream Technologies Ltd - Confiden7al

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

TRAINING PROGRAM ON BIGDATA/HADOOP

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop Usage At Yahoo! Milind Bhandarkar

Virtualizing Apache Hadoop. June, 2012

Transcription:

Data Stream Algorithms in Storm and R Radek Maciaszek

Who Am I? l Radek Maciaszek l l l l l l Consul9ng at DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy. Data scien9st at a hedge fund in London MSc in Bioinforma9cs MSc in Cogni9ve and Decisions Sciences BSc Computer Science During the career worked with many companies on Big Data and real 9me processing projects; indcluding Orange, Unanimis, SkimLinks, Cogni9veMatch, OpenX, ad4game, ecourier and many others.

Agenda Why streaming data? Streaming algorithms crash course Storm Storm + R Use Cases

Data Explosion Exponen9al growth of informa9on [IDC, 2012]

Data, data everywhere [Economist] In 2013, the available storage capacity could hold 33% of all data. By 2020, it will be able to store less than 15% [IDC, 2014]

Data Streams crash course Reasons to use data streams processing Data doesn t fit into available memory and/or disk Near real- 9me data processing Scalability, cloud processing Examples Network traffic Web traffic (i.e. online adver9sing) Fraud detec9on

Use Case Dynamic Sampling OpenX - open- source ad server Millions of ad views per hour Challenge Create ad samples for sta9s9cal analysis. E.g: A/B tes9ng, ANOVA, etc. The data doesn t fit into the memory. Solu9on Reservoir Sampling allows to find a sample of a constant length from a stream of unknown length of elements

Data Streaming algorithms Sampling Use sta9s9c of a sample to eskmate the stakskc of popula9on. The bigger the sample the beder the es9mate. Reservoir Sampling sample popula9ons, without knowing it s size. Algorithm: Store first n elements into the reservoir. Insert each k- th from the input stream in a random spot of the reservoir with a probability of n/k (decreasing probability) Source: Maldonado, et al; 2011

Use Case - CounKng Unique Users 100m+ daily visits Challenge - number of unique visitors - one of the most important metrics in online adver9sing Hadoop MapReduce. It worked but took long 9me and much memory. Solu9on: HyperLogLog algorithm Highly effec9ve algorithm to count dis9nct number of elements See Redis HyperLogLog data structure Many other use cases: Cardinality in DB queries op9misa9on ISP es9mates of traffic usage

Cardinality eskmakon Idea: Convert high- dimensional data to a smaller dimensional space. Use lower dimensional image to es9mate the func9on of interest. Example: count number of words in all works of Shakespeare Naïve solu9on: keep a set where you add all new words Probabilis9c coun9ng 1983. Flajolet & Mar9n. (first streaming algorithm) A beder one LogLog algorithm [Durand & Flajolet; 2003]

ProbabilisKc counkng Transform input data into i.i.d. (independent and iden9cally distributed) uniform random bits of informa9on Hash(x)= bit1 bit2 Where P(bit1)=P(bit2)=1/2 1xxx - > P = 1/2, n >= 2 11xx - > P = 1/4, n >= 4 111x - > P = 1/8, n >= 8 n >= Record biggest Flajolet (1983) es9mated the bias and p = posi9on of a first 0 unhashed hashed Source: hdp://git.io/vectc

ProbabilisKc counkng - algorithm p calculates posi9on of first zero in the bitmap Es9mate the size using: R proof- of- concept implementa9on: hdp://git.io/ve8ia Example:

HyperLogLog LogLog instead of keeping track of all 01s, keep track only of the largest 0 This will take LogLog bits, but at the cost of lost precision Example: 00000000 - > 1, 10100000 - > 4, 11100000 - > 4, 11110010 - > 8 SuperLogLog remove x% (typically 70%) of largest number before es9ma9ng, more complex analysis HyperLogLog harmonic mean of SuperLogLog es9mates Fast, cheap and 98% correct What if we want more sophis9cated analy9cs? Reference: Flajolet; Fusy et al. 2007

Moving average Example, online mean of the moving average of 9me- series at 9me t Where: M window size of the moving average. Any any 9me t predict t+1 by removing last t- M element, and adding t element. Requires last M elements to be stored in memory There are many more: mean, variance, regression, percen9le

R Open Source StaKsKcs Open Source = low cost of adop9ng. Useful in prototyping. Large global community - more than 2.5 million users ~5,000 open source free packages Extensively used for modelling and visualisa9ons Microsou Courts Data Scien9sts with Revolu9on Analy9cs Buy (WSJ) Source: Rexer Analy9cs

Use Case Real- Kme Machine Learning Gaming ad- network 150m+ ad impressions per day 10 servers Storm cluster Lambda architecture (fast and batch layers): used in parallel to Hadoop, NoSQL Challenge Make real- 9me decision on which ad to display Use sophis9cated sta9s9cal environment to A/B test ads Solu9on Use Storm + R to do real- 9me sta9s9cs Beta DistribuKon to compare two ads

Apache Storm Real- 9me calcula9ons the Hadoop of real 9me Fault tolerance Easy to scale Easy to develop - has local and distributed mode Storm mul9- lang can be used with any language, here with R Gedy Images

Apache Storm Background Open sourced in September 2011 Now part of Apache Storm Implementa9on ~15k LOC Wriden in Clojure / Java Stream processing processes messages and updates database Con9nuous computa9on query source, sends results to clients DRPC distributed RPC Guaranteed message processing no data loss Transac9onal topologies Storm Trident easy to develop using abstract API

Nimbus Storm Architecture Master - equivalent of Hadoop JobTracker Distributes workload across cluster Heartbeat, realloca9on of workers when needed Supervisor Runs the workers Communicates with Nimbus using ZK Zookeeper coordina9on, nodes discovery Source: Apache Storm

Storm Topology Graph of stream computa9ons Basic primi9ves nodes Spout source of streams (Twider API, queue, logs) Bolt consumes streams, does the work, produces streams Can integrate with third party languages and databases: Java Python Ruby Redis Hbase Cassandra Image source: Storm github wiki

Storm + R Storm Mul9- Language protocol Mul9ple Storm- R mul9- language packages provide Storm/R plumbing Recommended package: hdp://cran.r- project.org/web/packages/storm Example R code

Use Case Beta DistribuKons Simple approach: Wolphram Alpha: beta distribu9on (5, (30-5)) Source: Wolphram Alpha

Beta distribukons prototyping the R code Bootstrapping in R

Storm and R storm = Storm$new(); storm$lambda = function(s) { t = s$tuple; t$output = vector(mode="character",length=1); clicks = as.numeric(t$input[1]); views = as.numeric(t$input[2]); t$output[1] = rbeta(1, clicks, views - clicks); s$emit(t); #alternative: mark the tuple as failed. s$fail(t); } storm$run();

Storm and Java integrakon Define Spout/Bolt in any programming language Executed as subprocess JSON over stdin/stdout public static class RBolt extends ShellBolt implements IRichBolt { public RBolt() { super("rscript", script.r"); } } Source: Apache Storm

Storm + R = flexibility Integra9on with exis9ng Storm ecosystem NoSQL, Ka{a SOA framework - DRPC Scaling up your exis9ng R processes Trident Source: Apache Storm

Storm References hdps://storm.apache.org

Thank you Data Stream will save you when: Big Data problems Do not want to flip the coins for the next 10 years Make decisions in real- 9me Ques9ons and discussion hdps://uk.linkedin.com/in/radekmaciaszek hdp://www.dataminelab.com