Streamdrill: Analyzing Big Data Streams in Realtime

Similar documents

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Unified Big Data Analytics Pipeline. 连城

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Journée Thématique Big Data 13/03/2015

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How Companies are! Using Spark

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Using Summingbird for aggregating eye tracking data to find patterns in images in a multi-user environment

Real Time Analytics for Big Data. NtiSh Nati

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Quantifind s story: Building custom interactive data analytics infrastructure

High Performance Predictive Analytics in R and Hadoop:

Here comes the flood Tools for Big Data analytics. Guy Chesnot -June, 2012

Big Data and Analytics: Challenges and Opportunities

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

An Open-Source Streaming Machine Learning and Real-Time Analytics Architecture

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Moving From Hadoop to Spark

Graylog2 Lennart Koopmann, OSDC /

Large scale processing using Hadoop. Ján Vaňo

Analyzing Big Data with AWS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Customer Case Study. Automatic Labs

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Fast Data in the Era of Big Data: Twitter s Real-

The Internet of Things and Big Data: Intro

Azure Machine Learning, SQL Data Mining and R

How To Choose A Data Flow Pipeline From A Data Processing Platform

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Analysis: Apache Storm Perspective

Time series IoT data ingestion into Cassandra using Kaa

CSE-E5430 Scalable Cloud Computing Lecture 11

Fast Analytics on Big Data with H20

Driving Value From Big Data

HDP Hadoop From concept to deployment.

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Bayesian networks - Time-series models - Apache Spark & Scala

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Scaling Out With Apache Spark. DTL Meeting Slides based on

Building your Big Data Architecture on Amazon Web Services

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Taming the Internet of Things: The Lord of the Things

Intelligent Business Operations and Big Data Software AG. All rights reserved.

Big Data Analytics - Accelerated. stream-horizon.com

Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

BIG DATA What it is and how to use?

Complex Event Processing (CEP) Why and How. Richard Hallgren BUGS

Load Testing at Yandex. Alexey Lavrenuke

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Map-Reduce for Machine Learning on Multicore

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

COMP9321 Web Application Engineering

Technology Strategies for Big Data Analytics Paul Bachteal Director, Americas Technology Practice

Online and Scalable Data Validation in Advanced Metering Infrastructures

Productionizing a 24/7 Spark Streaming Service on YARN

Dynamic M2M Event Processing Complex Event Processing and OSGi on Java Embedded

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Oracle Big Data SQL Technical Update

NCAS National Caller ID Authentication System

Big-data Analytics: Challenges and Opportunities

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Practical Data Etsy. Dr. Jason Davis

Rackscale- the things that matter GUSTAVO ALONSO SYSTEMS GROUP DEPT. OF COMPUTER SCIENCE ETH ZURICH

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Big Data in Enterprise challenges & opportunities. Yuanhao Sun 孙元浩 yuanhao.sun@intel.com Software and Service Group

Online Content Optimization Using Hadoop. Jyoti Ahuja Dec

BIG DATA TOOLS. Top 10 open source technologies for Big Data

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Comprehensive Analytics on the Hortonworks Data Platform

CIS492 Special Topics: Cloud Computing د. منذر الطزاونة

Bigtable is a proven design Underpins 100+ Google services:

Spark and the Big Data Library

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Streaming Analytics A Framework for Innovation

The Big Data Paradigm Shift. Insight Through Automation

Hybrid Software Architectures for Big

NoSQL for SQL Professionals William McKnight

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Talend Big Data. Delivering instant value from all your data. Talend

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

A Brief Introduction to Apache Tez

Transcription:

Streamdrill: Analyzing Big Data Streams in Realtime Mikio L. Braun mikio@streamdrill.com @mikiobraun th 6

Realtime Big Data: Sources Finance Gaming Monitoring Advertisment Sensor Networks Social Media Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie

Tasks by Complexity Complexity Counting and Averages (over time windows), Count Distinct Profiles and Histograms Trends Outliers and Fraud detection Prediction (churn, failure)

Fast reponses Tasks by Latency Reporting Visualization and Monitoring Optimizing, Personalization Control Really Realtime only if you can react in realtime!

What makes Data Big? Many Events 100 events / second 360k per hour 8.6M per day 260M per month 3.2B per year Many Objects http://www.flickr.com/photos/arenamontanus/269158554/

Current approach: Scaling Batch (MapReduce) Stream (Storm, Spark) Expensive to scale to realtime!

Scaling? Approximate! Scaling is nice, but: Scaling is expensive Data is noisy Not every data point is important Methods are noisy, too Exact numbers often not necessary

Scaling vs. Approximation Scaling Approximation need raw processing power to get fast approximate more to get fast may compute results you don't need focusses on data you are interested in practically requires a cluster setup already consumes whole stream with one node

Heavy Hitters (a.k.a. Top-k) Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only. Case 1: element already in data base frank 15 paul 12 jan 8 felix 5 leo 3 alex 2 paul 142 12 13 Case 2: new element nico Fixed tables of counts alex 2 nico 3 Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

Wait a minute? Only Counting? Well, getting the top most active items is already useful. Web analytics, Users, Trending Topics Counting is statistics!

Counting is Statistics Empirical mean: Correlations: Covariance matrix ( PCA):

More: Maximum-Likelihood Estimate probabilistic models based on which is slightly biased, but simpler

Outlier detection Once you have a model, you can compute p-values (based on recent time frames!)

Online TF-IDF

So much more to do with trends Least Recently Used Caches Sparse Vectors Sparse Matrices Conditional Probabilities (Histograms) Accumulators...

streamdrill Core Engine: Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows Modules for specific use cases Features In-Memory, with snapshots to disk written in Scala Interface: Query by REST, push data by REST or UDP Single node performance: up to 20k events/s, about 1M objects per GB

Architecture Overview

streamdrill modules profiling streamdrill core recommendation fraud detection Ready made modules to cover core business applications.

Use Case: Realtime user profiles Objective Track user activity in different categories in realtime Event (user, category) Output Trends for (user, *)

Use Case: Realtime recommendations Objective Recommend items to users based on user interests, item popularity Event: (user, item, categories) Output: User profiles to find categories for item trends

Use Case: Realtime fraud and rate limiting Objective: Identify unusually active users/ips Unusually high co-occurence Event (id) or (id, device) Output trend for ids, or look at size of (id, *) or (*, device) above threshold

Summary Streamdrill: Big Data through approximation Counts are the basis for (nearyl) everything Try our demo: streamdrill.com