BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
|
|
|
- Octavia Bryant
- 10 years ago
- Views:
Transcription
1 BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining [email protected]
2 Outline Predictive modeling methodology k-nearest Neighbor (knn) algorithm Singular value decomposition (SVD) method for dimensionality reduction Using a synthetic data set to test and improve your model Experiment and results 2
3 The Business Problem Design product recommender solution that will increase revenue. $$ 3
4 How Do We Increase Revenue? Increase Revenue Increase Conversion Increase Avg. Order Value Increase Unit Price Increase Units / Order 4
5 Example Is this recommendation effective? Increase Unit Price Increase Units / Order 5
6 What am I going to do? 6
7 Predictive Model Framework Data Features ML Algorithm Prediction Output What data? What feature? Which Algorithm? Cross-sell & Up-sell Recommendation 7
8 What Data to Use? Explicit data Ratings Comments Implicit data Order history / Return history Cart events Page views Click-thru Search log In today s talk we only use Order history and Cart events 8
9 Predictive Model Data Features ML Algorithm Prediction Output Order History Cart Events What feature? Which Algorithm? Cross-sell & Up-sell Recommendation 9
10 What Features to Use? We know that a given product tends to get purchased by customers with similar tastes or needs. Use user engagement data to describe a product. users n item user engagement vector 0
11 Data Representation / Features When we merge every item s user engagement vector, we got a m x n item-user matrix users n items m
12 Data Normalization Ensure the magnitudes of the entries in the dataset matrix are appropriate users n items m.52.8 Remove column average so frequent buyers don t dominate the model 2
13 Data Normalization Different engagement data points (Order / Cart / Page View) should have different weights Common normalization strategies: Remove column average Remove row average Remove global mean Z-score Fill-in the null values 3
14 Predictive Model Data Features ML Algorithm Prediction Output Order History Cart Events User engagement vector Which Algorithm? Cross-sell & Up-sell Recommendation Data Normalization 4
15 Which Algorithm? How do we find the items that have similar user engagement data? users n.25 2 items m.25 We can find the items that have similar user engagement vectors with knn algorithm 5
16 k-nearest Neighbor (knn) Find the k items that have the most similar user engagement vectors users n.5 items m.5 Nearest Neighbors of Item 4 = [2,3,] 6
17 Similarity Measure for knn users n items 2 4 Jaccard coefficient: Cosine similarity: sim(a,b) = cos(a,b) = Pearson Correlation: corr(a,b) = = a.5 sim(a,b) = i a b 2 b 2 =.5 (+) (++) + (+++) (+) (r ai r a )(r bi r b ) (r ai r a ) 2 (r i bi r b ) 2 i (*+ 0.5 *) ( ) * ( ) = m a i b i a i b i m a 2 i ( a i ) 2 m b 2 i ( b i ) 2 match _ cols* Dotprod(a,b) sum(a) * sum(b) match _cols* sum(a 2 ) (sum(a)) 2 match _ cols* sum(b 2 ) (sum(b)) 2 7
18 k-nearest Neighbor (knn) feature space 9 7 Item Similarity Measure (cosine similarity) knn k=5 Nearest Neighbors(8) = [9,6,3,,2] 8
19 Predictive Model Ver. : knn Data Features ML Algorithm Prediction Output Order History Cart Events User engagement vector k-nearest Neighbor (knn) Cross-sell & Up-sell Recommendation Data Normalization 9
20 Cosine Similarity Code fragment long i_cnt = 00000; // number of items 00K long u_cnt = ; // number of users 2M double data[i_cnt][u_cnt]; // 00K by 2M dataset matrix (in reality, it needs to be malloc allocation) double norm[i_cnt]; // assume data matrix is loaded // calculate vector norm for each user engagement vector for (i=0; i<i_cnt; i++) { norm[i] = 0; for (f=0; f<u_cnt; f++) { norm[i] += data[i][f] * data [i][f]; }. 00K rows x 00K rows x 2M features --> scalability problem norm[i] = sqrt(norm[i]); kd-tree, Locality sensitive hashing, } MapReduce/Hadoop, Multicore/Threading, Stream Processors // cosine similarity calculation 2. data[i] is high-dimensional and sparse, similarity measures for (i=0; i<i_cnt; i++) { // loop thru 00K are not reliable --> accuracy problem for (j=0; j<i_cnt; j++) { // loop thru 00K This leads us to The SVD dimensionality reduction! dot_product = 0; for (f=0; f<u_cnt; f++) { // loop thru entire user space 2M dot_product += data[i][f] * data[j][f]; } printf( %d %d %lf\n, i, j, dot_product/(norm[i] * norm[j])); } 20 // find the Top K nearest neighbors here.
21 Singular Value Decomposition (SVD) A = U S V T A m x n matrix U m x r matrix S r x r matrix V T r x n matrix items items rank = k k < r users A k = U k S k V k T users users Low rank approx. Item profile is Low rank approx. User profile is U k * S k S k *V k T items 2 Low rank approx. Item-User matrix is U k * S k * S k *V k T
22 Reduced SVD A k = U k S k V k T A k 00K x 2M matrix U k 00K x 3 matrix S k 3 x 3 matrix V k T 3 x 2M matrix items items 0 0 rank = 3 users Descending Singular Values users Low rank approx. Item profile is U k * S k 22
23 SVD Factor Interpretation Singular values plot (rank=52) S 3 x 3 matrix Descending Singular Values More Significant Latent Factors Noises + Others 23 Less Significant
24 SVD Dimensionality Reduction U k * <----- latent factors -----> S k # of users items 3 rank Need to find the most optimal low rank!! 0 24
25 Missing values Difference between 0 and unknown Missing values do NOT appear randomly. Value = (Preference Factors) + (Availability) (Purchased elsewhere) (Navigation inefficiency) etc. Approx. Value = (Preference Factors) +/- (Noise) Modeling missing values correctly will help us make good recommendations, especially when working with an extremely sparse data set 25
26 Singular Value Decomposition (SVD) Use SVD to reduce dimensionality, so neighborhood formation happens in reduced user space SVD helps model to find the low rank approx. dataset matrix, while retaining the critical latent factors and ignoring noise. Optimal low rank needs to be tuned SVD is computationally expensive SVD Libraries: Matlab [U, S, V] = svds(a,256); SVDPACKC SVDLIBC GHAPACK 26
27 Predictive Model Ver. 2: SVD+kNN Data Features ML Algorithm Prediction Output Order History Cart Events User engagement vector k-nearest Neighbors (knn) in reduced space Cross-sell & Up-sell Recommendation Data Normalization SVD 27
28 Synthetic Data Set Why do we use synthetic data set? So we can test our new model in a controlled environment 28
29 Synthetic Data Set 6 latent factors synthetic e-commerce data set Dimension:,000 (items) by 20,000 (users) 6 user preference factors 6 item property factors (non-negative) Txn Set: n = 55,360 sparsity = % Txn+Cart Set: n = 92,985 sparsity = 99.03% Download: user_id item_id type
30 Synthetic Data Set Item property factors K x 6 matrix a b c User preference factors 6 x 20K matrix x y z items Purchase Likelihood score K x 20K matrix X X 2 X 3 X 4 X 5 X 6 X 2 X 22 X 2 X 24 X 25 X 26 X 3 X 32 X 33 X 34 X 35 X 36 X 4 X 42 X 43 X 44 X 45 X 46 X 5 X 52 X 53 X 54 X 55 X 56 users X 32 = (a, b, c). (x, y, z) = a * x + b * y + c * z X 32 = Likelihood of Item 3 being purchased by User 2 30
31 Synthetic Data Set X X 2 X 3 X 4 Based on the distribution, pre-determine # of items purchased by an user (# of item=2) X 5 X 4 X 3 X 4 Sort by Purchase likelihood Score X 2 X 5 From the top, select and skip certain items to create data sparsity. X 3 X 2 X 5 X X User purchased Item 4 and Item 3
32 Experiment Setup Each model (Random / knn / SVD+kNN) will generate top 20 recommendations for each item. Compare model output to the actual top 20 provided by synthetic data set Evaluation Metrics : Precision %: Overlapping of the top 20 between model output and actual (higher the better) Precision = {Found _ Top20_items} {Actual _ Top20_items} {Found _Top20_items} Quality metric: Average of the actual ranking in the model output (lower the better)
33 Experimental Result knn vs. Random (Control) Precision % (higher is better) Quality (Lower is better) 33
34 Experimental Result Precision % of SVD+kNN Recall % (higher is better) Improvement SVD Rank 34
35 Experimental Result Quality of SVD+kNN Quality (Lower is better) Improvement 35 SVD Rank
36 Experimental Result The effect of using Cart data Precision % (higher is better) SVD Rank 36
37 Experimental Result The effect of using Cart data Quality (Lower is better) SVD Rank 37
38 Outline Predictive modeling methodology k-nearest Neighbor (knn) algorithm Singular value decomposition (SVD) method for dimensionality reduction Using a synthetic data set to test and improve your model Experiment and results 38
39 References J.S. Breese, D. Heckerman and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering," in Proceedings of the Fourteenth Conference on Uncertainity in Artificial Intelligence (UAI 998), 998. B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "Item-based collaborative filtering recommendation algorithms," in Proceedings of the Tenth International Conference on the World Wide Web (WWW 0), pp , 200. B. Sarwar, G. Karypis, J. Konstan, and J. Riedl "Application of Dimensionality Reduction in Recommender System A Case Study" In ACM WebKDD 2000 Web Mining for E-Commerce Workshop Apache Lucene Mahout Cofi: A Java-Based Collaborative Filtering Library 39
40 Thank you Any question or comment? 40
! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II
! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and
Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering
Recommender Systems for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering Badrul M Sarwar,GeorgeKarypis, Joseph Konstan, and John Riedl {sarwar, karypis, konstan, riedl}@csumnedu
Application of Dimensionality Reduction in Recommender System -- A Case Study
Application of Dimensionality Reduction in Recommender System -- A Case Study Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John T. Riedl GroupLens Research Group / Army HPC Research Center Department
RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
PREA: Personalized Recommendation Algorithms Toolkit
Journal of Machine Learning Research 13 (2012) 2699-2703 Submitted 7/11; Revised 4/12; Published 9/12 PREA: Personalized Recommendation Algorithms Toolkit Joonseok Lee Mingxuan Sun Guy Lebanon College
Automated Collaborative Filtering Applications for Online Recruitment Services
Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Predicting User Preference for Movies using NetFlix database
Predicting User Preference for Movies using NetFlix database Dhiraj Goel and Dhruv Batra Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 {dgoel,dbatra}@ece.cmu.edu
IPTV Recommender Systems. Paolo Cremonesi
IPTV Recommender Systems Paolo Cremonesi Agenda 2 IPTV architecture Recommender algorithms Evaluation of different algorithms Multi-model systems Valentino Rossi 3 IPTV architecture 4 Live TV Set-top-box
Collaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
International Journal of Innovative Research in Computer and Communication Engineering
Achieve Ranking Accuracy Using Cloudrank Framework for Cloud Services R.Yuvarani 1, M.Sivalakshmi 2 M.E, Department of CSE, Syed Ammal Engineering College, Ramanathapuram, India ABSTRACT: Building high
Recommendation Tool Using Collaborative Filtering
Recommendation Tool Using Collaborative Filtering Aditya Mandhare 1, Soniya Nemade 2, M.Kiruthika 3 Student, Computer Engineering Department, FCRIT, Vashi, India 1 Student, Computer Engineering Department,
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Recommender Systems. User-Facing Decision Support Systems. Michael Hahsler
Recommender Systems User-Facing Decision Support Systems Michael Hahsler Intelligent Data Analysis Lab (IDA@SMU) CSE, Lyle School of Engineering Southern Methodist University EMIS 5/7357: Decision Support
Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services
Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services Ms. M. Subha #1, Mr. K. Saravanan *2 # Student, * Assistant Professor Department of Computer Science and Engineering Regional
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
The Application of Data-Mining to Recommender Systems
The Application of Data-Mining to Recommender Systems J. Ben Schafer, Ph.D. University of Northern Iowa INTRODUCTION In a world where the number of choices can be overwhelming, recommender systems help
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report
2012 Determining optimum insurance product portfolio through predictive analytics BADM Final Project Report Dinesh Ganti(61310071), Gauri Singh(61310560), Ravi Shankar(61310210), Shouri Kamtala(61310215),
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering
A Collaborative Filtering Recommendation Algorithm Based On User Clustering And Item Clustering GRADUATE PROJECT TECHNICAL REPORT Submitted to the Faculty of The School of Engineering & Computing Sciences
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek
Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...
Comparison of Standard and Zipf-Based Document Retrieval Heuristics
Comparison of Standard and Zipf-Based Document Retrieval Heuristics Benjamin Hoffmann Universität Stuttgart, Institut für Formale Methoden der Informatik Universitätsstr. 38, D-70569 Stuttgart, Germany
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey
How To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
Accurate is not always good: How Accuracy Metrics have hurt Recommender Systems
Accurate is not always good: How Accuracy Metrics have hurt Recommender Systems Sean M. McNee [email protected] John Riedl [email protected] Joseph A. Konstan [email protected] Copyright is held by the
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
User Data Analytics and Recommender System for Discovery Engine
User Data Analytics and Recommender System for Discovery Engine Yu Wang Master of Science Thesis Stockholm, Sweden 2013 TRITA- ICT- EX- 2013: 88 User Data Analytics and Recommender System for Discovery
The Need for Training in Big Data: Experiences and Case Studies
The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor
Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1
Paper 1886-2014 Recommending News s using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 1 GE Capital Retail Finance, 2 Warwick Business School ABSTRACT Predicting news articles
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
Predictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 [email protected] John Langford Yahoo! Research New York, NY 10018 [email protected] Alex Strehl Yahoo! Research New York,
Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
Lecture #2. Algorithms for Big Data
Additional Topics: Big Data Lecture #2 Algorithms for Big Data Joseph Bonneau [email protected] April 30, 2012 Today's topic: algorithms Do we need new algorithms? Quantity is a quality of its own Joseph
Big Data Analytics Verizon Lab, Palo Alto
Spark Meetup Big Data Analytics Verizon Lab, Palo Alto July 28th, 2015 Copyright 2015 Verizon. All Rights Reserved. Information contained herein is provided AS IS and subject to change without notice.
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Probabilistic Latent Semantic Analysis (plsa)
Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg [email protected] www.multimedia-computing.{de,org} References
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
RECOMMENDATION SYSTEM
RECOMMENDATION SYSTEM October 8, 2013 Team Members: 1) Duygu Kabakcı, 1746064, [email protected] 2) Işınsu Katırcıoğlu, 1819432, [email protected] 3) Sıla Kaya, 1746122, [email protected]
International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.
REVIEW ARTICLE ISSN: 2321-7758 UPS EFFICIENT SEARCH ENGINE BASED ON WEB-SNIPPET HIERARCHICAL CLUSTERING MS.MANISHA DESHMUKH, PROF. UMESH KULKARNI Department of Computer Engineering, ARMIET, Department
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
WEEK #3, Lecture 1: Sparse Systems, MATLAB Graphics
WEEK #3, Lecture 1: Sparse Systems, MATLAB Graphics Visualization of Matrices Good visuals anchor any presentation. MATLAB has a wide variety of ways to display data and calculation results that can be
Contact Recommendations from Aggegrated On-Line Activity
Contact Recommendations from Aggegrated On-Line Activity Abigail Gertner, Justin Richer, and Thomas Bartee The MITRE Corporation 202 Burlington Road, Bedford, MA 01730 {gertner,jricher,tbartee}@mitre.org
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
PRODUCTS AND SERVICES RECOMMENDATION SYSTEMS IN E-COMMERCE. RECOMMENDATION METHODS, ALGORITHMS, AND MEASURES OF THEIR EFFECTIVENESS
INFORMATYKA EKONOMICZNA BUSINESS INFORMATICS 1(31) 2014 ISSN 1507-3858 Mirosława Lasek, Dominik Kosieradzki University of Warsaw PRODUCTS AND SERVICES RECOMMENDATION SYSTEMS IN E-COMMERCE. RECOMMENDATION
Nonorthogonal Decomposition of Binary Matrices for Bounded-Error Data Compression and Analysis
Nonorthogonal Decomposition of Binary Matrices for Bounded-Error Data Compression and Analysis MEHMET KOYUTÜRK and ANANTH GRAMA Department of Computer Sciences, Purdue University and NAREN RAMAKRISHNAN
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
The Wisdom of the Few
The Wisdom of the Few A Collaborative Filtering Approach Based on Expert Opinions from the Web Xavier Amatriain Telefonica Research Via Augusta, 177 Barcelona 08021, Spain [email protected] Haewoon Kwak KAIST
Lecture 5: Singular Value Decomposition SVD (1)
EEM3L1: Numerical and Analytical Techniques Lecture 5: Singular Value Decomposition SVD (1) EE3L1, slide 1, Version 4: 25-Sep-02 Motivation for SVD (1) SVD = Singular Value Decomposition Consider the system
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.
Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
recommenderlab: A Framework for Developing and Testing Recommendation Algorithms
recommenderlab: A Framework for Developing and Testing Recommendation Algorithms Michael Hahsler Southern Methodist University Abstract The problem of creating recommendations given a large data base from
2. K-Nearest Neighbors Classifier. 1. Introduction. Paper
Paper A k-nearest Neighbors Method for Classifying User Sessions in E-Commerce Scenario Grażyna Suchacka 1, Magdalena Skolimowska-Kulig 1, and Aneta Potempa 2 1 Institute of Mathematics and Informatics,
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform interactive execution of map reduce jobs Pig is the name of the system Pig Latin is the
Mining an Online Auctions Data Warehouse
Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance
Model Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz
Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz Luigi Di Caro 1, Vanessa Frias-Martinez 2, and Enrique Frias-Martinez 2 1 Department of Computer Science, Universita di Torino,
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol
Data Algorithms Mahmoud Parsian Beijing Boston Farnham Sebastopol Tokyo O'REILLY Table of Contents Foreword xix Preface xxi 1. Secondary Sort: Introduction 1 Solutions to the Secondary Sort Problem 3 Implementation
Content-Boosted Collaborative Filtering for Improved Recommendations
Proceedings of the Eighteenth National Conference on Artificial Intelligence(AAAI-2002), pp. 187-192, Edmonton, Canada, July 2002 Content-Boosted Collaborative Filtering for Improved Recommendations Prem
Text Analytics (Text Mining)
CSE 6242 / CX 4242 Apr 3, 2014 Text Analytics (Text Mining) LSI (uses SVD), Visualization Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey
The Image Deblurring Problem
page 1 Chapter 1 The Image Deblurring Problem You cannot depend on your eyes when your imagination is out of focus. Mark Twain When we use a camera, we want the recorded image to be a faithful representation
Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico
Data Science Center Eindhoven Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico Dutch Mathematical Congress April 15, 2015 Contents 1. Big Data terminology 2. Various
