BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO



Similar documents
Time-Series Databases and Machine Learning

[Hadoop, Storm and Couchbase: Faster Big Data]

Dominik Wagenknecht Accenture

HDP Hadoop From concept to deployment.

Getting Real Real Time Data Integration Patterns and Architectures

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

From Spark to Ignition:

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

HDP Enabling the Modern Data Architecture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BIG DATA ANALYTICS For REAL TIME SYSTEM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

SGT Technology Innovation Center Dasvis Project

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

The 4 Pillars of Technosoft s Big Data Practice

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Comprehensive Analytics on the Hortonworks Data Platform

Evolution from Big Data to Smart Data

Big Data Web Analytics Platform on AWS for Yottaa

Big Data Infrastructure at Spotify

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Architectures for massive data management

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Real-time Big Data Analytics with Storm

Unified Batch & Stream Processing Platform

XpoLog Competitive Comparison Sheet

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Choosing The Right Big Data Tools For The Job A Polyglot Approach

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment

Big Data Analytics - Accelerated. stream-horizon.com

Streaming Big Data Performance Benchmark. for

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Improve performance and availability of Banking Portal with HADOOP

Cloudera Enterprise Data Hub in Telecom:

The Future of Data Management

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Predictive Analytics. Noam Zeigerson, CTO

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How to Choose Between Hadoop, NoSQL and RDBMS

A Comprehensive Review of Self-Service Data Visualization in MicroStrategy. Vijay Anand January 28, 2014

BIG DATA What it is and how to use?

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Driving Growth in Insurance With a Big Data Architecture

Evaluation of NoSQL databases for large-scale decentralized microblogging

Openbus Documentation

The Celebrus v8 Big Data Engine. Powering real-time personalisation, one-to-one data-driven marketing & advanced customer analytics.

Information Builders Mission & Value Proposition

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Big Data Analytics Nokia

Trafodion Operational SQL-on-Hadoop

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Big Data Pipeline and Analytics Platform

How To Make Data Streaming A Real Time Intelligence

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

HADOOP. Revised 10/19/2015

NStreamAware: Real-Time Visual Analytics for Data Streams to Enhance Situational Awareness

The Future of Data Management with Hadoop and the Enterprise Data Hub

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Transforming the Telecoms Business using Big Data and Analytics

How To Turn Big Data Into An Insight

Using Kafka to Optimize Data Movement and System Integration. Alex

How To Use Big Data For Telco (For A Telco)

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A New Approach to Network Visibility at UBC. Presented by the Network Management Centre and Wireless Infrastructure Teams

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

the missing log collector Treasure Data, Inc. Muga Nishizawa

Data processing goes big

Apache Kafka Your Event Stream Processing Solution

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Using Data Mining and Machine Learning in Retail

Case Study: Real-time Analytics With Druid. Salil Kalia, Tech Lead, TO THE NEW Digital

Reference Architecture, Requirements, Gaps, Roles

NOT IN KANSAS ANY MORE

Semantic Web Success Story

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

Transcription:

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

ANTHONY A. KALINDE SIGMA DATA SCIENCE GROUP ASSOCIATE

"REALTIME BEHAVIOURAL DATA COLLECTION CLICKSTREAM EXAMPLE" WHAT IS CLICKSTREAM ANALYTICS? QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA THE HADOOP ECOSYSTEM REALTIME CLICKSTREAM DATA COLLECTION ARCHITECTURE USE CASES AND EXAMPLES: MAPR, KAFKA, DIVULTE

BUT FIRST, SOME DEFINITIONS, SO WE ARE ALL SPEAKING THE SAME LANGUAGE "Some companies have built their very businesses on their ability to collect, analyze and act on data. Every company can learn from what these firms do." Davenport, T. H. (2006). Competing on analytics. harvard business review, (84), 98-107.

BUT FIRST, SOME DEFINITIONS, SO WE ARE ALL SPEAKING THE SAME LANGUAGE Descriptive analytics-- Using historical data to describe the business. Predictive analytics-- Using data to predict trends and patterns. Prescriptive analytics-- using data to suggest the optimal solution.

BUT FIRST, SOME DEFINITIONS, SO WE ARE ALL SPEAKING THE SAME LANGUAGE

COMPETITIVE ADVANTAGE MATURITY MODELS MASH UP: ORDER OF MAGNITUDE Prescriptive Analytics Predictive Analytics How can we achieve the best outcome? How can we achieve the best outcome? What will happen next if? What if these trends continue? What could happen? What actions are needed Descriptive Analytics What exactly is the problem How many how often where What happend Proactive DEGREE OF COMPLEXITY

SO, WHAT IS A CLICKSTREAM? & HOW IS IT RELEVANT FOR ANALYTICS IN MEDIA? "It is a story in data points waiting to be discovered..." - Annonymous

SO, WHAT IS A CLICKSTREAM? & HOW IS IT RELEVANT FOR ANALYTICS IN MEDIA? A Clickstream is simply, the electronic record of Internet usage collected by Web servers or thirdparty services. DATA FILTER/AGGR EGATE VISUALIZE STORY

SO, WHAT IS A CLICKSTREAM? & HOW IS IT RELEVANT FOR ANALYTICS IN MEDIA? Data science is the difference between reportreading and actionable insight - David Booth, Founding Partner, Cardinal Path DATA FILTER/AGGR EGATE VISUALIZE STORY

QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS RECOMMEND WITHOUT DATA (KNOWN AS COLD START PROBLEM) Content based TEKST recommendation EKST is the only option. User profile may have been provided TEKST explicitly by user or derived from user behavior e.g. pages visited, search terms etc.

QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS TACKLE THE TRADE OFF BETWEEN POPULARITY, FRESHNESS AND NORMAL ITEM TEKST Warm start EKST personalized recommendations limited interaction TEKST data.

QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS RECOMMEND IN REAL-TIME, ON THE FLY TEKST Using recent EKST as well as historical user engagement event data TEKST Optionally business logic

THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA WHAT DOES A 360 VIEW OF YOUR CUSTOMERS MEAN?

THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA WHAT KINDS OF DATA CAN WE COLLECT AND HOW CAN WE LEVERAGE THIS?

THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA WHAT ARE YOU MISSING? 95% OF USERS DO NOT CREATE A BASKET, OF THOSE THAT DO, ONLY HALF BEGIN THE CHECK OUT PROCESS, AND OF THOSE TWO-THIRDS ACTUALY, COMPLETE A PURCHASE. ANSWER? 98%

CAPTURING VALUE IN FAST DATA

WHERE DISTRIBUTED PROCESSING COMES IN FAST DATA = BIG DATA GROWING UP The way that big data gets big is through a constant stream of incoming data. In highvolume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored. -John Hugg, software architect at VoltDB

THE HADOOP ECOSYSTEM

REALTIME CLICKSTREAM DATA COLLECTION ARCHITECTURE

APACHE KAFKA LOGCENTRIC, DISTRIBUTED PUBLISH - SUBSCRIBE MESSAGING SYSTEM

APACHE KAFKA Maintains feeds of messages in categories called topics. LOGCENTRIC, Processes DISTRIBUTED that publish messages PUBLISH are - called topic producers. SUBSCRIBE MESSAGING Processes SYSTEM that subscribe to process the feed consumers.. Run as a cluster comprised of one or more servers each of which is called a broker.

APACHE STORM OPEN-SOURCE PROCESSING DISTRIBUTED REALTIME COMPUTATION SYSTEM

APACHE STORM STORM BASICS Spouts represent a streaming source and typically read from a queueing system A bolt is where the computation logic sits A topology is a network of these spouts and bolts

APACHE STORM CLICKSTREAM ANALYSIS WITH STREAM Augment online customer experience Targeted content placement Scalability - up to one million 100 byte messages per second per node can

APACHE STORM CLICKSTREAM ANALYSIS WITH STREAM Augment online customer experience Targeted content placement Scalability - up to one million 100 byte messages per second per node can

DIVOLTE.JS SCALABLE CLICKSTREAM TEKST COLLECTOR EKST FOR COLLECTING DATA IN HDFS TEKST AND ON KAFKA TOPICS

DIVOLTE.JS Modern click event collection Instead of using the server side log event, an event is generated on the client side, often called tagging. SCALABLE CLICKSTREAM TEKST COLLECTOR EKST FOR COLLECTING DATA IN HDFS TEKST AND ON KAFKA TOPICS

DIVOLTE.JS Features Single tag site integration Event logging is asynchronous Custom schema = On the fly parsing Built for Big data SCALABLE CLICKSTREAM TEKST COLLECTOR EKST FOR COLLECTING DATA IN HDFS TEKST AND ON KAFKA TOPICS Include Divolte Collector just before the closing body tag... <script src="//ec2-52-10-241-39.us-west- 2.compute.amazonaws.com:8290/divolte.js" defer async> </script> </body>

CASSANDRA NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM

CASSANDRA Distributed processing Decentralized Peer to Peer Architecture Single Cassandra cluster can can run across geographically dispersed data centers NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM

ELASTICSEARCH DISTRIBUTED, MULTITENANT- CAPABLE FULL- TEXT SEARCH ENGINE

KIBANA BROWSER BASED ANALYTICS AND SEARCH DASHBOARD FOR ELASTICSEARCH

WHY MAPR? Clickstream Analysis (predictive analysis) Customer 360 Dashboard Data Exploration (SQL) Integrated Single cluster Real time High performance, low latency Large-scale analytics Enterprise-grade HA/DR Unified file and table administration Mobile Application Server Web Application Server DB Operations Real-time Ad Targeting Real Time and Actionable Analytics Product/sService Optimization and Personalization

WHY MAPR? HERE IS 5 REASONS WHY HIGH AVAILABILITY WORLD-RECORD PERFORMANCE EASE OF DATA INTEGRATION REAL MULTI- LATENCY OPEN SOURCE READ-WRITE FILE SYSTEM

USE CASES & BENEFITS FOR MEDIA RECOMMENDERS & AGGREGATORS CONVERSION ABILITY TO REACT NOW VISITOR RELATIONSHIP MANAGEMENT FAST, EASY, CHEAP

CLICKSTREAM ANALYSIS EXAMPLE : Sacrifice small children and body part to Gods of live demos SEVERAL SMALL CHILLDREN

THANK YOU FOR LISTENNING!