Large Data Methods: Keeping it Simple on the Path to Big Data



Similar documents
Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Big Data Unlock the mystery and see what the future holds. Philip Sow SE Manager, SEA

Big Data & Cloud Computing. Faysal Shaarani

Rise of the Machines: An Internet-Wide Analysis of Web Bots in 2014

Architectural Considerations for Hadoop Applications

THE KEY ADVANTAGES OF BUSINESS INTELLIGENCE AND ANALYTICS

Installation Guide. Step-by-Step Guide for clustering Hyper-V virtual machines with Sanbolic s Kayo FS. Table of Contents

Webtrends for SharePoint 2010 A Microsoft Preferred Analytics Solution for SharePoint

How, What, and Where of Data Warehouses for MySQL

Chapter 6. E-commerce Marketing Concepts. Copyright 2009 Pearson Education, Inc. Slide 6-1

Data Warehousing and Data Mining

SKoolAide Privacy Policy

Cloud Computing at Google. Architecture

Analytic Applications With PHP and a Columnar Database

Leveraging Machine Data to Deliver New Insights for Business Analytics

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

web analytics ...and beyond Not just for beginners, We are interested in your thoughts:

Direct Marketing Analytics with R

Big Data & Netflix. Paul Ellwood February 9th, 2015

Big Data Analytics Nokia

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Marketing Automation and Analytics. Are You Getting the Most out of your Website? Asheville Chamber of Commerce Nov 11, 2014

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Maksym Iaroshenko Co-Founder and Senior Software Engineer at Eltrino. Magento non-mysql implementations

Data Driven Success. Comparing Log Analytics Tools: Flowerfire s Sawmill vs. Google Analytics (GA)

IS466 Decision Support Systems. SQL Server Business Intelligence Development Studio 2008 User Guide

This document describes the capabilities of NEXT Analytics v5.1 to retrieve data from Google Analytics directly into your spreadsheet file.

UNIFY YOUR (BIG) DATA

Transforming the Telecoms Business using Big Data and Analytics

ER/Studio Enterprise Portal User Guide

Teradata Unified Big Data Architecture

Aaron Werman.

SQL Performance for a Big Data 22 Billion row data warehouse

10 Essential Google Analytics Reports And How They Matter to B2B Executives

Google Analytics Guide. for BUSINESS OWNERS. By David Weichel & Chris Pezzoli. Presented By

Google Analytics: Tracking Where a Visitor Goes on Your Web Site

SQL Server 2012 Business Intelligence Boot Camp

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Oracle Data Miner (Extension of SQL Developer 4.0)

10231B: Designing a Microsoft SharePoint 2010 Infrastructure

We may collect the following types of information during your visit on our Site:

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Web Analytics Definitions Approved August 16, 2007

Microsoft Data Warehouse in Depth

Data Mining in the Swamp

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Apache Kylin Introduction Dec 8,

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

Sisense. Product Highlights.

Data warehousing with PostgreSQL

Practical Web Analytics for User Experience

Product: DQ Order Manager Release Notes

A Look at Self Service BI with SAP Lumira Natasha Kishinevsky Dunn Solutions Group SESSION CODE: 1405

Data Is Integral To Our Culture

Customer Analytics. Turn Big Data into Big Value

POLAR IT SERVICES. Business Intelligence Project Methodology

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Google Analytics and Google Analytics Premium: limits and quotas

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Big Data; Old News or New Hype? Marcel den Hartog, June 2012

Client Requirement. Why SharePoint

Thank you for visiting this website, which is owned by Essendant Co.

Introduction. Chapter 1 Why Understanding Your Web Traffic Is Important to Your Business 3

Course Outline: Course 6317: Upgrading Your SQL Server 2000 Database Administration (DBA) Skills to SQL Server 2008 DBA Skills

Hadoop & its Usage at Facebook

Free Trial - BIRT Analytics - IAAs

DBMS / Business Intelligence, SQL Server

SQL Access to OpenEdge Apps

Google Analytics 101

Establish and maintain Center of Excellence (CoE) around Data Architecture

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Performance Analytics with TDSz and TCR

iservdb The database closest to you IDEAS Institute

WEBSITE ANALYSIS OVERVIEW

Real-time Big Data Analytics with Storm

LEARNING SOLUTIONS website milner.com/learning phone

SQL Server 2005 Features Comparison

MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Implementing a Data Warehouse with Microsoft SQL Server 2012

the missing log collector Treasure Data, Inc. Muga Nishizawa

Transcription:

Large Data Methods: Keeping it Simple on the Path to Big Data Big Data Business Forum, San Francisco, November 13, 2012 Jim Porzak, Sr. Dir. Business Intelligence, Minted 1

What we will cover: 1. Who is Minted? 2. Our large data challenges. 3. Large data solutions. 4. Migration to big data. 5. Discussion 2

About Minted A social commerce site. Crowd-sourcing graphic designs and art from a global design community. Selling those as printed paper products. Initially focused on the $10 billion stationery and $48 billion art markets. Combining community with commerce. Built on stellar technology, operations, and customer service. 3

Minted.com Architecture ~ Classic LAMP Integrated back office: MBO MySQL holds site & MBO data 4

Minted BI Architecture On Amazon EC2: Replicated MySQL site DB PostgreSQL BI DB Tableau Server 5

In support of marketing: Our job is to understand the customer, the whole customer, and nothing but the customer. 6

CIA 7

C ustomer I nitiated A ctions 8

CIA s are: Ordering Other Minted.com actions Responding to Minted outreach Email opens, reads, clicks Contacting Minted Social Behavior And more! 9

Ordering: How do you sell bread? 1. You tell me what you want. 2. I give you the bread. 3. I tell you how much it costs. 4. You give me the money. 10

11

Other Site CIA s 1. Google Analytics Not for individual visitor! f(tagging(t)) 2. App-request logs CIA s & some 2 nd level Visitor (cookie) & user ID s 12 months of history 12

Responses to email: 1. Sends Transactional: order ack, ship, Marketing: retention, offers Targeted & personalized 2. CIA s Bounce, open, click, buy Opt-outs, opt-ins 13

Other data sources: Convertro Survey Tools Demographic Appends 14

Roadmap to Big Data Large Data: PostgreSQL Big Data: Some columnar DB Bigger Data: Some map-reduce platform 15

Why PostgreSQL? Open source (~free) SQL for analytics Window functions, etc. Known to scale Foundation for many of the columnar DB products 16

BI in PostgreSQL philosophy Data structures as if columnar Big & wide; not star Focus on high impact first: Orders Order Detail Customers Site Sessions 17

Primary wide tables 18

Table Details Customer: 35 columns ID, source, acquisition date/source, purchase profiles, site profiles, demo profiles Order Summary: 72 columns ID s, date, order seq #, gap, $ s, # s, flags, top, prior, first product code/group/class, to-date $ s by class, geo, source. Order Product Detail: 27 columns ID s, timestamp, SKU, $, #, promo, details of options Site Sessions: 27 columns ID s, seq #, # events, duration, timestamps, gap, funnel flags, products, actions, sources, media, campaigns, Site Clean Events: 21 columns ID s, timestamp, ip, seq #, interval to prior, entry/exit actions, source/medium/campaign, sku, 19

Performance Queries off of these tables very fast; typically sub-minute for even the most complex. Tableau server has internal columnar engine for interactive analytics performance. Nightly refresh & updates in under three hours with no attempt at tuning. 20

Next Steps: Finalize logical design in PostgreSQL based on needs of our business partners over next few months. Tune PostgreSQL platform test limits of scale. Estimate when we will need to move to next level. In meantime: POC s on a couple of columnar DBs. Migrate to final columnar DB. 21

Questions? Comments? Now would be the time! 22

APPENDIX Application request logs deep dive 23

What s an app-request log? {"time_start":1313620339.85,"time_end":1313620340.01,"request":{"headers":[["host","localhost:8888"],["connection", "keep-alive"],["cache-control","max-age=0"],["user-agent","mozilla\/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit\/534.30 (KHTML, like Gecko) Chrome\/12.0.742.112 Safari\/534.30"],["Accept","text\/html,application\/xhtml+xml,application\/xml;q=0.9,*\/*;q=0.8"],["Accept- Encoding","gzip,deflate,sdch"],["Accept-Language","en-US,en;q=0.8"],["Accept-Charset","ISO-8859-1,utf- 8;q=0.7,*;q=0.3"]],"method":"GET","remote_addr":"127.0.0.1","protocol":"HTTP\/1.1","uri":"\/register"},"response":{"stat us":null,"headers":[["expires","sun, 19 Nov 1978 05:00:00 GMT"],["Last-Modified","Wed, 17 Aug 2011 22:32:19 GMT"],["Cache-Control","store, no-cache, must-revalidate"],["cache-control","post-check=0, pre-check=0"],["content- Type","text\/html; charset=utf-8"],["x-powered-by","php\/5.3.2; Qcodo\/0.3.24 (Qcodo Beta 3)"],["Set- Cookie","minted_tr=UQ%BDn%830%10%7E%17%EF%04%8C%9D41S%D4%A1%EA%90%8C%5D%91%03%26% B5%0Aq%E5%3B%90h%94w%EF%9D%0BR3%FA%BE%DF%3B%5B%23%B5%B9%83QF4%16EeMa%EE%8F% F4%9E%5C%14%957%B2%02%B3Oh%0D%7D%40%1E%95%5BEC%29%8D%E8%7C%04%AC%27%0F%3E%01 %B2%D4%3B%BD%D7%07%A9%19%2F%8C%E8%ED%13%AC%A4%DA%95%85%D2%05%C3z%95G%D7%B9 %189%0D%8C%A6%E0%8BC%AB6%83%BF%A1k7M%18%F2f%04%0C%83%FFq1%87%AF1%3F%BD%9F%B3 %E3%DBkv%FA8%B2D%AA%25%E7%BF%0Fe%27%AC%5CC+%8C%B1q%A2%3A%2F%CD%93%E2iH%D4%D 44%C4%1A%E7o%27%96%BDS%F6%9F%29%27%AD%94%86%0C%80%EFU%F2%F9%B6%04%04d%D6e%3D X%EB%C0_o%EB%88NJ%FEC%CF%0A%C9%8A%17ftv%EC%D3e%0A%26%1C%D6%AE%E9%27H%40.C%C0 O%17%21k%ED%9C%5D%7D%87%90%01Z%F4%81%8C%E7%A5a%DAd%91pC%B0%93%CBHE%1A%A4- %1E%BF; expires=mon, 13-Feb-2012 22:32:20 GMT; path=\/"]]}} 24

Making app-requests useful 1. Parse to.csv (Python) 2. Load to BI PostgreSQL DB 3. Clean & more parsing 4. Sessionize 5. Analyze 25

Sessionization in PostgreSQL Part 1 -- get event sequence #s & seconds after prior event CREATE TABLE v_sessions1 AS SELECT *, ROW_NUMBER() OVER(Members) AS event_seq_number, event_at - LAG(event_at) OVER(Members) AS interval_to_prior FROM v_events WINDOW Members AS (PARTITION BY member_id -- unique member ID ORDER BY event_at -- timestamp of event ) ; 26

Sessionization in PostgreSQL Part 2 -- update with session sequence # CREATE TABLE v_sessions2 AS SELECT *, COUNT(CASE WHEN interval_to_prior IS NULL OR interval_to_prior > '30 minutes' THEN 1 ELSE NULL END) OVER(Members) AS session_seq_number FROM v_sessions1 WINDOW Members AS (PARTITION BY member_id -- unique member ID ORDER BY event_seq_number -- Session # ) ; 27

Sessionization in PostgreSQL Part 3 -- now roll up into sessions getting session start, total time in session, -- site areas explored, other site specific rollups CREATE TABLE v_sessions AS SELECT member_id, ; session_seq_number, MIN(event_at) AS session_start_at, COUNT(*) AS num_events_in_session, SUM(CASE WHEN interval_to_prior > '30 minutes' THEN NULL ELSE interval_to_prior END) AS session_duration, STRING_AGG(DISCRETE site_area, ', ') AS site_areas_visited, <other site specific aggregations> FROM v_sessions2 GROUP BY member_id, session_seq_number ORDER BY member_id, session_seq_number 28

How mining app-request logs helps us understand the customer: 1. Sales funnel analysis by product & YOY. 2. Customer s individual interests, preferences, 3. Customer s evolution in relation w/ Minted 4. Usage based customer segments 5. And we will discover more! 29