Streamdrill: Analyzing Big Data Streams in Realtime Mikio L. Braun mikio@streamdrill.com @mikiobraun th 6
Realtime Big Data: Sources Finance Gaming Monitoring Advertisment Sensor Networks Social Media Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie
Tasks by Complexity Complexity Counting and Averages (over time windows), Count Distinct Profiles and Histograms Trends Outliers and Fraud detection Prediction (churn, failure)
Fast reponses Tasks by Latency Reporting Visualization and Monitoring Optimizing, Personalization Control Really Realtime only if you can react in realtime!
What makes Data Big? Many Events 100 events / second 360k per hour 8.6M per day 260M per month 3.2B per year Many Objects http://www.flickr.com/photos/arenamontanus/269158554/
Current approach: Scaling Batch (MapReduce) Stream (Storm, Spark) Expensive to scale to realtime!
Scaling? Approximate! Scaling is nice, but: Scaling is expensive Data is noisy Not every data point is important Methods are noisy, too Exact numbers often not necessary
Scaling vs. Approximation Scaling Approximation need raw processing power to get fast approximate more to get fast may compute results you don't need focusses on data you are interested in practically requires a cluster setup already consumes whole stream with one node
Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only. Case 1: element already in data base frank 15 paul 12 jan 8 felix 5 leo 3 alex 2 paul 142 12 13 Case 2: new element nico Fixed tables of counts alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005
Wait a minute? Only Counting? Well, getting the top most active items is already useful. Web analytics, Users, Trending Topics Counting is statistics!
Counting is Statistics Empirical mean: Correlations: Covariance matrix ( PCA):
More: Maximum-Likelihood Estimate probabilistic models based on which is slightly biased, but simpler
Outlier detection Once you have a model, you can compute p-values (based on recent time frames!)
Online TF-IDF
So much more to do with trends Least Recently Used Caches Sparse Vectors Sparse Matrices Conditional Probabilities (Histograms) Accumulators...
streamdrill Core Engine: Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows Modules for specific use cases Features In-Memory, with snapshots to disk written in Scala Interface: Query by REST, push data by REST or UDP Single node performance: up to 20k events/s, about 1M objects per GB
Architecture Overview
streamdrill modules profiling streamdrill core recommendation fraud detection Ready made modules to cover core business applications.
Use Case: Realtime user profiles Objective Track user activity in different categories in realtime Event (user, category) Output Trends for (user, *)
Use Case: Realtime recommendations Objective Recommend items to users based on user interests, item popularity Event: (user, item, categories) Output: User profiles to find categories for item trends
Use Case: Realtime fraud and rate limiting Objective: Identify unusually active users/ips Unusually high co-occurence Event (id) or (id, device) Output trend for ids, or look at size of (id, *) or (*, device) above threshold
Summary Streamdrill: Big Data through approximation Counts are the basis for (nearyl) everything Try our demo: streamdrill.com