BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO
ANTHONY A. KALINDE SIGMA DATA SCIENCE GROUP ASSOCIATE
"REALTIME BEHAVIOURAL DATA COLLECTION CLICKSTREAM EXAMPLE" WHAT IS CLICKSTREAM ANALYTICS? QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA THE HADOOP ECOSYSTEM REALTIME CLICKSTREAM DATA COLLECTION ARCHITECTURE USE CASES AND EXAMPLES: MAPR, KAFKA, DIVULTE
BUT FIRST, SOME DEFINITIONS, SO WE ARE ALL SPEAKING THE SAME LANGUAGE "Some companies have built their very businesses on their ability to collect, analyze and act on data. Every company can learn from what these firms do." Davenport, T. H. (2006). Competing on analytics. harvard business review, (84), 98-107.
BUT FIRST, SOME DEFINITIONS, SO WE ARE ALL SPEAKING THE SAME LANGUAGE Descriptive analytics-- Using historical data to describe the business. Predictive analytics-- Using data to predict trends and patterns. Prescriptive analytics-- using data to suggest the optimal solution.
BUT FIRST, SOME DEFINITIONS, SO WE ARE ALL SPEAKING THE SAME LANGUAGE
COMPETITIVE ADVANTAGE MATURITY MODELS MASH UP: ORDER OF MAGNITUDE Prescriptive Analytics Predictive Analytics How can we achieve the best outcome? How can we achieve the best outcome? What will happen next if? What if these trends continue? What could happen? What actions are needed Descriptive Analytics What exactly is the problem How many how often where What happend Proactive DEGREE OF COMPLEXITY
SO, WHAT IS A CLICKSTREAM? & HOW IS IT RELEVANT FOR ANALYTICS IN MEDIA? "It is a story in data points waiting to be discovered..." - Annonymous
SO, WHAT IS A CLICKSTREAM? & HOW IS IT RELEVANT FOR ANALYTICS IN MEDIA? A Clickstream is simply, the electronic record of Internet usage collected by Web servers or thirdparty services. DATA FILTER/AGGR EGATE VISUALIZE STORY
SO, WHAT IS A CLICKSTREAM? & HOW IS IT RELEVANT FOR ANALYTICS IN MEDIA? Data science is the difference between reportreading and actionable insight - David Booth, Founding Partner, Cardinal Path DATA FILTER/AGGR EGATE VISUALIZE STORY
QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS RECOMMEND WITHOUT DATA (KNOWN AS COLD START PROBLEM) Content based TEKST recommendation EKST is the only option. User profile may have been provided TEKST explicitly by user or derived from user behavior e.g. pages visited, search terms etc.
QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS TACKLE THE TRADE OFF BETWEEN POPULARITY, FRESHNESS AND NORMAL ITEM TEKST Warm start EKST personalized recommendations limited interaction TEKST data.
QUESTIONS ANSWERED BY CLICKSTREAM ANALYSIS RECOMMEND IN REAL-TIME, ON THE FLY TEKST Using recent EKST as well as historical user engagement event data TEKST Optionally business logic
THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA WHAT DOES A 360 VIEW OF YOUR CUSTOMERS MEAN?
THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA WHAT KINDS OF DATA CAN WE COLLECT AND HOW CAN WE LEVERAGE THIS?
THINKING BIG! APPLYING BIG DATA ANALYTICS TO CLICKSTREAM DATA WHAT ARE YOU MISSING? 95% OF USERS DO NOT CREATE A BASKET, OF THOSE THAT DO, ONLY HALF BEGIN THE CHECK OUT PROCESS, AND OF THOSE TWO-THIRDS ACTUALY, COMPLETE A PURCHASE. ANSWER? 98%
CAPTURING VALUE IN FAST DATA
WHERE DISTRIBUTED PROCESSING COMES IN FAST DATA = BIG DATA GROWING UP The way that big data gets big is through a constant stream of incoming data. In highvolume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored. -John Hugg, software architect at VoltDB
THE HADOOP ECOSYSTEM
REALTIME CLICKSTREAM DATA COLLECTION ARCHITECTURE
APACHE KAFKA LOGCENTRIC, DISTRIBUTED PUBLISH - SUBSCRIBE MESSAGING SYSTEM
APACHE KAFKA Maintains feeds of messages in categories called topics. LOGCENTRIC, Processes DISTRIBUTED that publish messages PUBLISH are - called topic producers. SUBSCRIBE MESSAGING Processes SYSTEM that subscribe to process the feed consumers.. Run as a cluster comprised of one or more servers each of which is called a broker.
APACHE STORM OPEN-SOURCE PROCESSING DISTRIBUTED REALTIME COMPUTATION SYSTEM
APACHE STORM STORM BASICS Spouts represent a streaming source and typically read from a queueing system A bolt is where the computation logic sits A topology is a network of these spouts and bolts
APACHE STORM CLICKSTREAM ANALYSIS WITH STREAM Augment online customer experience Targeted content placement Scalability - up to one million 100 byte messages per second per node can
APACHE STORM CLICKSTREAM ANALYSIS WITH STREAM Augment online customer experience Targeted content placement Scalability - up to one million 100 byte messages per second per node can
DIVOLTE.JS SCALABLE CLICKSTREAM TEKST COLLECTOR EKST FOR COLLECTING DATA IN HDFS TEKST AND ON KAFKA TOPICS
DIVOLTE.JS Modern click event collection Instead of using the server side log event, an event is generated on the client side, often called tagging. SCALABLE CLICKSTREAM TEKST COLLECTOR EKST FOR COLLECTING DATA IN HDFS TEKST AND ON KAFKA TOPICS
DIVOLTE.JS Features Single tag site integration Event logging is asynchronous Custom schema = On the fly parsing Built for Big data SCALABLE CLICKSTREAM TEKST COLLECTOR EKST FOR COLLECTING DATA IN HDFS TEKST AND ON KAFKA TOPICS Include Divolte Collector just before the closing body tag... <script src="//ec2-52-10-241-39.us-west- 2.compute.amazonaws.com:8290/divolte.js" defer async> </script> </body>
CASSANDRA NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM
CASSANDRA Distributed processing Decentralized Peer to Peer Architecture Single Cassandra cluster can can run across geographically dispersed data centers NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM NOSQL, DISTRIBUTED DATABASE MANAGEMENT SYSTEM
ELASTICSEARCH DISTRIBUTED, MULTITENANT- CAPABLE FULL- TEXT SEARCH ENGINE
KIBANA BROWSER BASED ANALYTICS AND SEARCH DASHBOARD FOR ELASTICSEARCH
WHY MAPR? Clickstream Analysis (predictive analysis) Customer 360 Dashboard Data Exploration (SQL) Integrated Single cluster Real time High performance, low latency Large-scale analytics Enterprise-grade HA/DR Unified file and table administration Mobile Application Server Web Application Server DB Operations Real-time Ad Targeting Real Time and Actionable Analytics Product/sService Optimization and Personalization
WHY MAPR? HERE IS 5 REASONS WHY HIGH AVAILABILITY WORLD-RECORD PERFORMANCE EASE OF DATA INTEGRATION REAL MULTI- LATENCY OPEN SOURCE READ-WRITE FILE SYSTEM
USE CASES & BENEFITS FOR MEDIA RECOMMENDERS & AGGREGATORS CONVERSION ABILITY TO REACT NOW VISITOR RELATIONSHIP MANAGEMENT FAST, EASY, CHEAP
CLICKSTREAM ANALYSIS EXAMPLE : Sacrifice small children and body part to Gods of live demos SEVERAL SMALL CHILLDREN
THANK YOU FOR LISTENNING!