Big Data Use Cases. At Salesforce.com. Narayan Bharadwaj Director, Product Management Salesforce.com. @nadubharadwaj



Similar documents
Developers: Build Next Generation Apps. Michael Yeganeh Solution Engineering Lead

The Fastest Path to the Cloud Building Your SaaS Company on Force.com

PLATFORM AS A SERVICE MULTI TENANCY AND OPEN STANDARDS. Peter salesforce.com!

The Desktop is Dead... Let s Talk About the Living! Bruce Richardson, Chief Enterprise Strategist brichardson@salesforce.com

Webhooks. Near-real time event processing with guaranteed delivery of HTTP callbacks. HBaseCon 2015

Cloud to Cloud Integrations with Force.com. Sandeep Bhanot Developer

Salesforce.com and the financial services sector

VerticalResponse for AppExchange: Past, Present and Future

Secure Coding SSL, SOAP and REST. Astha Singhal Product Security Engineer salesforce.com

SPRING 14 RELEASE NOTES

Hunk & Elas=c MapReduce: Big Data Analy=cs on AWS

Welcome to the Force.com Developer Day

How To Use Splunk For Android (Windows) With A Mobile App On A Microsoft Tablet (Windows 8) For Free (Windows 7) For A Limited Time (Windows 10) For $99.99) For Two Years (Windows 9

Force.com: Secure Cloud Development. Varun Badhwar Force.com Security Manager

Investor Presenta,on Third Quarter ServiceNow All Rights Reserved 1

WELCOME! Webinar on roundcorner's donor engagement platform roundcause. with Childfund International, IRC, Salesforce Foundation and roundcorner

Machine Learning using MapReduce

Building the Global Cloud

Stream Deployments in the Real World: Enhance Opera?onal Intelligence Across Applica?on Delivery, IT Ops, Security, and More

BIG DATA - HADOOP PROFESSIONAL amron

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Appendix A: Case Studies

Unified Batch & Stream Processing Platform

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

COURSE CONTENT Big Data and Hadoop Training

Salesforce Certified Force.com Developer Study Guide

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Implement Hadoop jobs to extract business value from large and varied data sets

Testing Big data is one of the biggest

ITG Software Engineering

KICK-START CLOUD VENTURES

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Splunk Enterprise in the Cloud Vision and Roadmap

Upcoming Announcements

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Open source Google-style large scale data analysis with Hadoop

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Automate Your BI Administration to Save Millions with Command Manager and System Manager

SmartConnect User Credentials 2012

Cloudera Manager Introduction

BENCHMARKING V ISUALIZATION TOOL

SpringCM Integration Guide. for Salesforce

Development Model for the Cloud Paradigm Shift of the Same Old Same Old? Dr. Umit Yalcinalp, Salesforce.com Developer Evangelist

Distributed Calculus with Hadoop MapReduce inside Orange Search Engine. mardi 3 juillet 12

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

An Open Dynamic Big Data Driven Applica3on System Toolkit

WHAT S NEW IN SAS 9.4

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Ankush Cluster Manager - Hadoop2 Technology User Guide

Has been into training Big Data Hadoop and MongoDB from more than a year now

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Salesforce Admin Course Content: Chapter 1 CRM Introduction Introduction to CRM? Why CRM?

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Exchange of experience from a SuccessFactors LMS Implementa9on

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Machine- Learning Summer School

Deploying Hadoop with Manager

Chatter Answers Implementation Guide

What's New in SAS Data Management

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

Streaming items through a cluster with Spark Streaming

Recommendation Tool Using Collaborative Filtering

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

Salesforce Integration

Big Data Spatial Analytics An Introduction

Student Project 2 - Apps Frequently Installed Together

BIG DATA SOLUTION DATA SHEET

Making big data simple with Databricks

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Distributed Computing and Big Data: Hadoop and MapReduce

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Workshop on Hadoop with Big Data

Oracle Data Miner (Extension of SQL Developer 4.0)

Hadoop & Spark Using Amazon EMR

Hadoop Development & BI- 0 to 100

THE STATE OF THE DATA WAREHOUSE

Transcription:

Big Data Use Cases At Salesforce.com Narayan Bharadwaj Director, Product Management Salesforce.com @nadubharadwaj

Safe harbor Safe harbor statement under the Private Securi9es Li9ga9on Reform Act of 1995: This presenta9on may contain forward- looking statements that involve risks, uncertain9es, and assump9ons. If any such uncertain9es materialize or if any of the assump9ons proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward- looking statements we make. All statements other than statements of historical fact could be deemed forward- looking, including any projec9ons of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future opera9ons, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertain9es referred to above include but are not limited to risks associated with developing and delivering new func9onality for our service, new products and services, our new business model, our past opera9ng losses, possible fluctua9ons in our opera9ng results and rate of growth, interrup9ons or delays in our Web hos9ng, breach of our security measures, the outcome of intellectual property and other li9ga9on, risks associated with possible mergers and acquisi9ons, the immature market in which we operate, our rela9vely limited opera9ng history, our ability to expand, retain, and mo9vate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non- salesforce.com products, and u9liza9on and selling to larger enterprise customers. Further informa9on on poten9al factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10- Q for the most recent fiscal quarter ended July 31, 2012. This documents and others containing important disclosures are available on the SEC Filings sec9on of the Investor Informa9on sec9on of our Web site. Any unreleased services or features referenced in this or other presenta9ons, press releases or public statements are not currently available and may not be delivered on 9me or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obliga9on and does not intend to update these forward- looking statements.

Agenda Big Data use cases Technology Use case discussion Collabora9ve Filtering Q&A

Got Cloud Data? 130k customers Millions of users 800 million transac9ons/day Terabytes/day

Technology

Big Data Ecosystem

Data Science tools ecosystem Apache Pig Version=0.9.1

Contribu9ons @prashant1784 : Prashant Kommireddi Lars Ho<ansl @thefutureian : Ian Varley

Big Data Use Cases Product Metrics User behavior analysis Capacity planning Monitoring intelligence Collec9ons Query Run9me Predic9on Early Warning System Collabora9ve Filtering Search Relevancy Internal App Product feature

Product Metrics

Product Metrics Problem Statement Track feature usage/adop9on across 130k+ customers Eg: Accounts, Contacts, Visualforce, Apex, Track standard metrics across all features Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime, Track features and metrics across all channels API, UI, Mobile Primary audience: Execu9ves, Product Managers

Product Metrics Pipeline User Input (Page Layout) CollaboraWon (ChaXer) Reports, Dashboards Feature Metrics (Custom Object) Trend Metrics (Custom Object) API Workflow Formula Fields API Client Machine Java Program Pig script generator Hadoop Workflow Log Pull Log Files

VisualizaWon (Reports & Dashboards)

VisualizaWon (Reports & Dashboards)

Collaborate, Iterate (ChaXer)

User Behavior Analysis

Problem Statement How do we reduce number of clicks on the user interface? What are the top user click path sequences? What are the user clusters/personas? Approach: Markov transi9on for click path, D3.js visuals K- means (unsupervised) clustering for user groups

Markov TransiWons for "Setup" pages

K- means clustering of "Setup" pages

Collabora9ve Filtering Jed Crosby

CollaboraWve Filtering Problem Statement Show similar files within an organiza9on Content- based approach Community- base approach

Popular File

Related File

We found this relawonship using item- to- item collaborawve filtering Amazon published this algorithm in 2003. Amazon.com RecommendaAons: Item- to- Item CollaboraAve Filtering, by Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Compu9ng, January- February 2003. At Salesforce, we adapted this algorithm for Hadoop, and we use it to recommend files to view and users to follow.

Example: CF on 5 files Annual Report Vision Statement Dilbert Comic Darth Vader Cartoon Disk Usage Report

View History Table Miranda (CEO) Annual Report Vision Statement Dilbert Cartoon Darth Vader Cartoon Disk Usage Report 1 1 1 0 0 Bob (CFO) 1 1 1 0 0 Susan (Sales) Chun (Sales) 0 1 1 1 0 0 0 1 1 0 Alice (IT) 0 0 1 1 1

RelaWonships between the files Annual Report Vision Statement Dilbert Cartoon Darth Vader Cartoon Disk Usage Report

RelaWonships between the files Annual Report 2 Vision Statement 2 3 0 1 Dilbert Cartoon 0 3 0 Darth Vader Cartoon 1 1 Disk Usage Report

Sorted relawonships for each file Annual Report Vision Statement Dilbert Cartoon Darth Vader Cartoon Disk Usage Report Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1) Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1) Darth Vader (1) Annual Rpt. (2) Disk Usage (1) Disk Usage (1) The popularity problem: no9ce that Dilbert appears first in every list. This is probably not what we want. The solu9on: divide the relawonship tallies by file populariwes.

Normalized relawonships between the files Annual Report.82 Vision Statement.63.77 0.33 Dilbert Cartoon 0.77 0 Darth Vader Cartoon.45.58 Disk Usage Report

Sorted relawonships for each file, normalized by file populariwes Annual Report Vision Statement Dilbert Cartoon Darth Vader Cartoon Disk Usage Report Vision Stmt. (.82) Annual Report (.82) Darth Vader (.77) Dilbert (.77) Darth Vader (.58) Dilbert (.63) Dilbert (.77) Vision Stmt. (.77) Disk Usage (.58) Dilbert (.45) Darth Vader (.33) Annual Report (.63) Vision Stmt. (.33) Disk Usage (.45) High rela9onship tallies AND similar popularity values now drive closeness.

The item- to- item CF algorithm 1) Compute file populari9es 2) Compute rela9onship tallies and divide by file populari9es 3) Sort and store the results

MapReduce Overview Map Shuffle Reduce (adapted from hsp://code.google.com/p/mapreduce- framework/wiki/ MapReduce)

1. Compute File PopulariWes <user, file> Inverse iden9ty map <file, List<user>> Reduce <file, (user count)> Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.

Example: File popularity for Dilbert (Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert) Inverse iden9ty map <Dilbert, {Miranda, Bob, Susan, Chun, Alice}> Reduce (Dilbert, 5)

2a. Compute relawonship tallies - find all relawonships in view history table <user, file> Iden9ty map <user, List<file>> Reduce <(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, <(file(n- 1), file(n)), Integer(1)> Rela9onships have their file IDs in alphabe9cal order to avoid double coun9ng.

Example 2a: Miranda s (CEO) file relawonship votes (Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert) Iden9ty map <Miranda, {Annual Report, Vision Statement, Dilbert}> Reduce <(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)>

2b. Tally the relawonship votes - just a word count, where each relawonship occurrence is a word <(file1, file2), Integer(1)> Iden9ty map <(file1, file2), List<Integer(1)> Reduce: count and divide by populari9es <file1, (file2, similarity score)>, <file2, (file1, similarity score)> Note that we emit each result twice, one for each file that belongs to a rela9onship.

Example 2b: the Dilbert/Darth Vader relawonship <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)> Iden9ty map <(Dilbert, Vader), {1, 1, 1}> Reduce: count and divide by populari9es <Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>

3. Sort and store results <file1, (file2, similarity score)> Iden9ty map <file1, List<(file2, similarity score)>> Reduce <file1, {top n similar files}> Store the results in your loca9on of choice

Example 3: SorWng the results for Dilbert <Dilbert, (Annual Report,.63)>, <Dilbert, (Vision Statement,.77)>, <Dilbert, (Disk Usage,.45)>, <Dilbert, (Darth Vader,.77)> Iden9ty map <Dilbert, {(Annual Report,.63), (Vision Statement,.77), (Disk Usage,.45), (Darth Vader,.77)}> Reduce <Dilbert, {Darth Vader, Vision Statement}> (Top 2 files) Store results

Appendix Cosine formula and normaliza9on trick to avoid the distributed cache cosθ AB = A B A B = A A B B Mahout has CF Asympto9c order of the algorithm is O(M*N 2 ) in worst case, but is helped by sparsity.

Narayan Bharadwaj Director, Product Management @nadubharadwaj