Big Data Analytics Verizon Lab, Palo Alto



Similar documents
Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Big Data Analytics. Lucas Rego Drumond

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

DATA ANALYSIS II. Matrix Algorithms

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

IPTV Recommender Systems. Paolo Cremonesi

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Similarity Search in a Very Large Scale Using Hadoop and HBase

Report: Declarative Machine Learning on MapReduce (SystemML)

1. Classification problems

Matrix Multiplication

Probabilistic Latent Semantic Analysis (plsa)

Content-Based Recommendation

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Streamdrill: Analyzing Big Data Streams in Realtime

Distributed R for Big Data

Distributed Computing and Big Data: Hadoop and MapReduce

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

Collaborative Filtering Scalable Data Analysis Algorithms Claudia Lehmann, Andrina Mascher

Forschungskolleg Data Analytics Methods and Techniques

Machine Learning Final Project Spam Filtering

Collaborative Filtering. Radek Pelánek

Factorization Machines

Lecture 2 Matrix Operations

SOLVING LINEAR SYSTEMS

Recommendations in Mobile Environments. Professor Hui Xiong Rutgers Business School Rutgers University. Rutgers, the State University of New Jersey

Mining a Corpus of Job Ads

The Need for Training in Big Data: Experiences and Case Studies

Linear Algebra: Vectors

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

Lecture 3: Models of Spatial Information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

DocStore: Document Database for MySQL at Facebook. Peng Tian, Tian Xia 04/14/2015

Big-data Analytics: Challenges and Opportunities

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

CUDAMat: a CUDA-based matrix class for Python

6. Cholesky factorization

A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering

Hybrid Programming with MPI and OpenMP

Social Media Mining. Data Mining Essentials

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Fast Analytics on Big Data with H20

CASH FLOW MATCHING PROBLEM WITH CVaR CONSTRAINTS: A CASE STUDY WITH PORTFOLIO SAFEGUARD. Danjue Shang and Stan Uryasev

MLlib: Scalable Machine Learning on Spark

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Manifold Learning Examples PCA, LLE and ISOMAP

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Linear Programming. March 14, 2014

Matrix Algebra in R A Minimal Introduction

MATLAB and Big Data: Illustrative Example

Linear Algebra Review. Vectors

Deal with big data in R using bigmemory package

Matrix Calculations: Applications of Eigenvalues and Eigenvectors; Inner Products

Big Data and Scripting map/reduce in Hadoop

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Parameterizable benchmarking framework for designing a MapReduce performance model

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

1 Review of Least Squares Solutions to Overdetermined Systems

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Chapter 17. Orthogonal Matrices and Symmetries of Space

Big Data looks Tiny from the Stratosphere

Important Notice. (c) Cloudera, Inc. All rights reserved.

Big Data Summarization Using Semantic. Feture for IoT on Cloud

Extreme Machine Learning with GPUs John Canny

Soft Clustering with Projections: PCA, ICA, and Laplacian

Parquet. Columnar storage for the people

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

CSE 6040 Computing for Data Analytics: Methods and Tools

Journée Thématique Big Data 13/03/2015

Quantifind s story: Building custom interactive data analytics infrastructure

CA Clarity PPM. Financial Management User Guide. v

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

How To Cluster

Blacklisting and Blocking

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Spark and the Big Data Library

Data Mining for Web Personalization

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Maximum-Margin Matrix Factorization

Java Virtual Machine: the key for accurated memory prefetching

Presto/Blockus: Towards Scalable R Data Analysis

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Main Memory Map Reduce (M3R)

Transcription:

Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015 Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners.

Similarity Computation

Similarity Computation Flows Column based flow for tall-skinny matrices (60 M users, 100K items) Mapper: emit (item-i, item-j), score-ij Reducer: reduce over (item-i, item-j) to get similarity-ij Spark 1.2 RowMatrix.columnSimilarities! Row based flow https://issues.apache.org/jira/browse/spark-4823 Column similarity in tall-wide matrices 60M users,1m-10m items from advertising use-cases Kernel generation for tall-skinny matrices 60M users, 50-400 latent factors from advertising use-cases 10M devices, skinny features from IoT use-cases

Row Based Flow Preprocess Column similarity in tall-wide matrices : Transpose data matrix Kernel generation for tall-skinny matrices : Input data matrix Algorithm Distributed matrix multiply using blocked cartesian pattern Shuffle space control using topk and similarity threshold User specified kernel for vector dot product Supported kernels: Cosine, Euclidean, RBF, ScaledProduct Code optimization Norm caching for efficiency (kernel abstraction differ from scikit-learn) DGEMM for dense vectors : Spark 1.4 recommendforall BLAS.dot for sparse vectors : https://github.com/apache/spark/pull/6213

Kernel Examples CosineKernel: item->item similarity case class CosineKernel(rowNorms: Map[Long, Double], threshold: Double) extends Kernel { override def compute(vi: Vector, indexi: Long, vj: Vector, indexj: Long): Double = { val similarity = BLAS.dot(vi, vj) / rownorms(indexi) / rownorms(indexj) if (similarity <= threshold) return 0.0 similarity } } ScaledProductKernel: memory based recommendation! case class ScaledProductKernel(rowNorms: Map[Long, Double]) extends Kernel { override def compute(vi: Vector, indexi: Long, vj: Vector, indexj: Long): Double = { BLAS.dot(vi, vj) / rownorms(indexi) } }

Runtime Analysis Dataset Details col 1e-2 row 1e-2 ML-1M ML-10M ML-20M Netflix 900 ratings 1M 10M 20M 100M users 6040 69878 138493 480189 675 items 3706 10677 26744 17770 Production Examples Data matrix: 60 M x 2.5 M minsupport: 500 itemthreshold: 1000 Runtime: ~ 4 hrs Runtime (s) 450 225 0 ML-1M ML-10M ML-20M Netflix items

Shuffle Write Analysis 40000 col 1e-2 row 1e-2 Shuffle Write (MB) 30000 20000 10000 0 ML-1M ML-10M ML-20M Netflix movies

TopK Shuffle Write Analysis For Row Based Flow ML-1M ML-10M ML-20M Netflix 4000 Shuffle Write (MB) 3000 2000 1000 0 50 100 200 400 1000 row 1e-2 topk

!! Recommendation Engine

Memory based: knn based recommendation algorithm using similarity engine Model based: ALS based implicit feedback formulation Datasets MovieLens 1M Netflix Mapped ratings to binary features for comparison Evaluate recommendation performance using RMSE Recommendation Algorithms Precision @ k Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. #

knn Based Formulation Predicted rating val similaritems = SimilarityMatrix.rowSimilarities(! itemfeatures,! numneighbors, threshold)! val kernel = new ScaledProductKernel(rowNorms)! val recommendation = SimilarityMatrix.multiply( similaritems, userfeatures,! kernel, k) Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. #

ALS Formulation Implicit feedback datasets: Unobserved items are considered 0 (implicit feedback) Minimize! Needs Gram matrix aggregation for 0-ratings val als = new ALSQp().setRank(params.rank).setIterations(params. numiterations).setuserconstraint(constraints.smooth).setitemconstraint(constraints.smooth).setimplicitprefs(true).setlambda(params.lambda).setalpha(params.alpha)! val mfmodel = als.run(training) RankingUtils.recommendItemsForUsers(mfModel, k, skipitems) Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. #

Comparing knn and ALS on RMSE RMSE on MovieLens 0.62 RMSE on Netflix 0.661 0.571 0.561 knn 30 neighbors ALS knn 30 neighbors ALS Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. #

Comparing knn and ALS on Prec@k (Netflix) Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. #

! Segmentation Engine

Segmentation Feature Extraction Input data contains location and time information along with other features Extract time-unit features for each location (zip code) Raw data Sparse Website Matrix Id Time (Hour) Zip Code websites abc 10 94301 website1 abc 15 94085 website2 94301 # of hours (1-24) 94085 # of hours (1-24) website1 website2 # of hours (1-24) # of hours (1-24) def 10 94301 website1........ Column Count Zip codes 31516 Websites 11646 Ratings 45M

ALS with Positive Constraints ZipCode x Website Sparse Matrix W T H k m n m n k Each column of H represent Website factors minimize Each row of W T represent ZipCode factors What do columns of W T and rows of H represent?

Segment Analysis I Local websites Global websites

Segment Analysis II Segments Most factors display geographic affinity.

Use ALSQp for Nonnegative Matrix Factorization val als = new ALSQp().setRank(params.rank).setIterations(params. numiterations).setuserconstraint(constraints.positive).setitemconstraint(constraints.positive).setimplicitprefs(true)!.setlambda(params.lambda)! val mfmodel = als.run(training) Other constraints:!.setitemconstraint(constraints.simplex) // 1 T w = s, w>=0 and s - constant https://github.com/apache/spark/pull/3221

Q and A