Case study: d60 Raptor smartadvisor. Jan Neerbek Alexandra Institute



Similar documents
HDMQ :Towards In-Order and Exactly-Once Delivery using Hierarchical Distributed Message Queues. Dharmit Patel Faraj Khasib Shiva Srivastava

Hardware/Software Guidelines

Siemens SAP Cloud - Infrastructure as a Service

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Load Testing Analysis Services Gerhard Brückl

Web Server (Step 1) Processes request and sends query to SQL server via ADO/OLEDB. Web Server (Step 2) Creates HTML page dynamically from record set

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Data Mining Algorithms Part 1. Dejan Sarka

3 Ways to build a SaaS Product. Asteor Software Inc Ram Kumar - Director Product Management

SQL Server 2005 Features Comparison

About Me: Brent Ozar. Perfmon and Profiler 101

Big Data on Microsoft Platform

Cloud Computing through Virtualization and HPC technologies

CLOUD PERFORMANCE TESTING - KEY CONSIDERATIONS (COMPLETE ANALYSIS USING RETAIL APPLICATION TEST DATA)

Maximizing SQL Server Virtualization Performance

On Demand Satellite Image Processing

Running R from Amazon's Elastic Compute Cloud

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

A Review on Data Mining in Cloud Computing Environment

Binary Coded Web Access Pattern Tree in Education Domain

Scaling Analysis Services in the Cloud

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

<Insert Picture Here> Best Practices for Extreme Performance with Data Warehousing on Oracle Database

PARALLELS CLOUD SERVER

Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

wu.cloud: Insights Gained from Operating a Private Cloud System

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Cloud Computing and Amazon Web Services

Enterprise and Standard Feature Compare

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

INCREASING THE CLOUD PERFORMANCE WITH LOCAL AUTHENTICATION

The basic data mining algorithms introduced may be enhanced in a number of ways.

Top 10 reasons your ecommerce site will fail during peak periods

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

System Requirements for Microsoft Dynamics NAV 2009 SP1

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

Parallels Plesk Automation

A Middleware Strategy to Survive Compute Peak Loads in Cloud

Big Fast Data Hadoop acceleration with Flash. June 2013

IBM: Using Queue Replication

Using Peer to Peer Dynamic Querying in Grid Information Services

LabStats 5 System Requirements

SQL Server Administrator Introduction - 3 Days Objectives

Big data, big deal? Presented by: David Stanley, CTO PCI Geomatics

CA NSM System Monitoring Option for OpenVMS r3.2

MS SQL Performance (Tuning) Best Practices:

Comparison of Data Mining Techniques for Money Laundering Detection System

SQL Sentry Essentials

Microsoft Big Data. Solution Brief

AZURE / HYBRID SCENARIOS. M a n a g i n g C o n s u l t a n t

CS 6343: CLOUD COMPUTING Term Project

DARMADI KOMO: Hello, everyone. This is Darmadi Komo, senior technical product manager from SQL Server marketing.

Accelerating Web-Based SQL Server Applications with SafePeak Plug and Play Dynamic Database Caching

ActiveVOS Performance Tuning

Application Migration Best Practices. Gregory Shepard Senior Consultant InCycle Software

Amazon EC2 XenApp Scalability Analysis

Scaling up to Production

Building Platform as a Service for Scientific Applications

Scalable Architecture on Amazon AWS Cloud

SQL Server Performance Tuning for DBAs

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Distributed Apriori in Hadoop MapReduce Framework

Solid State Storage in Massive Data Environments Erik Eyberg

SURFsara HPC Cloud Workshop

An Efficient Frequent Item Mining using Various Hybrid Data Mining Techniques in Super Market Dataset

Microsoft Azure Cloud on your terms. Start your cloud journey.

Analysis on Virtualization Technologies in Cloud

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Scaling Database Performance in Azure

Conquer the 5 Most Common Magento Coding Issues to Optimize Your Site for Performance

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Oracle Enterprise Manager 12c New Capabilities for the DBA. Charlie Garry, Director, Product Management Oracle Server Technologies

Survey on Load Rebalancing for Distributed File System in Cloud

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

The Cost of the Cloud. Steve Saporta CTO, SwipeToSpin Mar 20, 2015

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Fast Data in the Era of Big Data: Twitter s Real-

HP Client Automation Standard Fast Track guide

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Rackspace Cloud Databases and Container-based Virtualization

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

BASICS OF SCALING: LOAD BALANCERS

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Controlling the Linux ecognition GRID server v9 from a ecognition Developer client

Cloud Computing and Amazon Web Services. CJUG March, 2009 Tom Malaher

Oracle Database 11g Comparison Chart

Module 14: Scalability and High Availability

Transcription:

Case study: d60 Raptor smartadvisor Jan Neerbek Alexandra Institute

Agenda d60: A cloud/data mining case Cloud Data Mining Market Basket Analysis Large data sets Our solution 2

Alexandra Institute The Alexandra Institute is a non-profit company that works with applicationoriented IT research. Focus is pervasive computing, and we activate the business potential of our members and customers through researchbased userdriven innovation. 3

The case: d60 Danish company A similar products recommendation engine d60 was outgrowing their servers (late 2010) They saw a potential in moving to Azure 4

The setup Webshops Internet Product Recommendations Log shopping patterns Do data mining 5

The cloud potential Elasticity No upfront server cost Cheaper licenses Faster calculations 6

Challenges No SQL Server Analysis Services (SSAS) Small compute nodes Partioned database (50GB) SQL server ingress/outgress access is slow 7

The cloud 8

The cloud and services Data layer service Messaging Service 9

Data layer service Application specific (schema/layout) SQL, table or other Easy a bottleneck Can be difficult to scale Data layer service 10

Messaging service Task Queues Standard data structure Build-in ordering (FIFO) Can be scaled Good for asynchronous messages Messaging Service 11

12

Data mining Data mining is the use of automated data analysis techniques to uncover relationships among data items Market basket analysis is a data mining technique that discovers co-occurrence relationships among activities performed by specific individuals 13 [about.com/wikipedia.org]

Market basket analysis Customer1 Customer2 Customer3 Customer4 Beef Cereal Lemons Butter Beef Potatoes Chips 14

Market basket analysis Customer1 Customer2 Customer3 Customer4 Beef Cereal Lemons Butter Beef Potatoes Chips Itemset (, ) occur 50% Frequency threshold parameter Find as many frequent itemsets as possible 15

Market basket analysis Popular effective algorithm: FP-growth Based on data structure FP-tree Requires all data in near-memory Most research in distributed models has been for cluster setups 16

Building the FP-tree (extends the prefix-tree structure) Customer1 Butter Butter Potatoes Potatoes 17

Building the FP-tree Customer2 Butter Potatoes 18

Building the FP-tree Customer2 Butter Potatoes 19

Building the FP-tree Customer2 Butter Potatoes 20

Building the FP-tree Beef Butter Chips Cereal Potatoes Lemon 21

FP-growth Grows the frequent itemsets, recusively FP-growth(FP-tree tree) { for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { OutputSet(item); sub = tree.gettree(tree, item); FP-growth(sub); } 22

FP-growth algorithm Divide and Conquer Traverse tree Beef Butter Chips Cereal Potatoes Lemon 23

FP-growth algorithm Divide and Conquer Generate sub-trees Beef Butter Chips Cereal Potatoes Lemon 24

FP-growth algorithm Divide and Conquer Call recursively Beef Butter Chips Cereal Butter Potatoes Lemon 25

FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? Emulate Distributed Shared Memory 26

Distributed Shared Memory? CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Network Shared Memory To add nodes is to add memory Works best in tightly coubled setups, with low-lantency, high-speed networks 27

FP-growth algorithm Memory usage The FP-tree does not fit in local memory; what to do? Emulate Distributed Shared Memory Optimize your data structures Buy more RAM Get a good idea 28

Get a good idea Database scans are serial and can be distributed The list of items used in the recursive calls uniquely determines what part of data we are looking at 29

Get a good idea Beef Butter Chips Cereal Butter Potatoes Lemon 30

Get a good idea Butter, Butter These are postfix paths, 31

32

Buckets Use postfix paths for messaging Working with buckets Transactions Items 33

FP-growth revisited FP-growth(FP-tree tree) { Done in parallel for-each (item in tree) count =CountOccur(tree,item); if (IsFrequent(count)) { Done in parallel } OutputSet(item); sub = tree.gettree(tree, item); FP-growth(sub); Replaced with postfix Done in parallel 34

Communication Data layer 35

Revised Communication MQ Data layer 36

Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 37

Running FP-growth Distribute buckets Count items (with postfix size=n) Collect counts (per postfix) Call recursive Standard FP-growth 38

Collecting what we have learned Message-driven work, using message-queue Peer-to-peer for intermediate results Distribute data for scalability (buckets) Small messages (list of items) Allow us to distribute FP-growth 39

Advantages Configurable work sizes Good distribution of work Robust against computer failure Fast! 40

So what about performance? 04:30:00 04:00:00 03:30:00 03:00:00 02:30:00 02:00:00 Message-driven FP-growth FP-growth Total node time 01:30:00 01:00:00 00:30:00 00:00:00 1 2 4 8 41

Thank you! 42