Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
|
|
|
- Cory Perkins
- 9 years ago
- Views:
Transcription
1 Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information and relationships. Our goal was to create a "friendship model" to be used for the following purposes: 1) Recommend a new friend to a user given the user s existing friends. 2) Given a user s friends, determine which of those friends would be users best friends. To make these predictions, we gathered user profile information such friends, groups, interests, music, and activities. We trained our models based on these features using techniques from both supervised and unsupervised learning. For the first goal, we implemented logistic regression and Naïve Bayes learning algorithms to predict which users are most likely to be friends. For the second task, we used PCA and K-means clustering to compare a given user s friends to each other. It is highly likely that the friendship structure of a certain person would be very different than that of another person. Furthermore, there is a large variation among the profile information of different people (e.g. we will not necessarily have the same features for each person, and the relative importance of features may be different for different users). Therefore, we decided to train our hypothesis separately for every person whose friends we would try to understand. We use the term central user to denote the user for whom we are training the algorithm. Instead of defining features for individual users, we defined the features for each sample as a function of a central user and the sample we are comparing against. For example, a feature would not be the number of friends of a user, but rather the number of friends a user has in common with the central user. 2. Data Acquisition We used the Facebook API to collect user data. Data collection steps were as follows: choose central users, find their friends (positive samples), choose a set of non-friends (negative samples), and gather profile information for every positive and negative sample with respect to the central user. Limitations of the API and privacy settings created some obstacles in data collection. First, some profiles allow only limited access, which in some cases makes it impossible to see any data other than the name of the user. For this reason, we limited ourselves to profiles within the Stanford network, since most non-stanford profiles allow us very little access. Second, it is not possible to directly query for all of a given user s friends. One can only query whether two particular users are friends. Therefore, we needed to start with a large set of user ids, and it was not possible to find all the friends of a user unless our starting set contained all of them. The third problem was the lack of an easy way to get a list of users according to arbitrary criteria (such as Stanford students). We had to collect data in a less direct way; we started with a Stanford student-group, found all the Stanford students in that group, then found other groups to which those users belong, and repeated until we had around 4 samples. It is possible that this method could have introduced some bias into our data set that could have been avoided had it been possible to select 4 random Stanford users. Because a given user s friends make up a small percentage of our network, we chose to include in our training and test sets all of the central user s friends (that we found in our set of users) and up to nonfriends. We made this choice to include as much information about friends as possible, but we also had to keep in mind the implications of having a training set with a different distribution than the overall data, which we discuss in more detail below. 3. Model Evaluation One of the major challenges of this project was determining how to evaluate our models. One reason this is difficult is that although the models we used for friend suggestion are typically used for classification, our application is not quite classification. Rather than outputting y=1 (friend) or y= (non-
2 friend) for each candidate user, we instead want to select the most likely candidates out of a given pool. Therefore, measuring test error is not as simple as finding the percentage of misclassifications on a test set. One property that an error metric should have is a higher emphasis on precision than recall; we do not need to return all likely candidates, but we want the ones we return to be good. A second problem with the friend suggestion task is that what we are testing is not quite the same as the goal we are trying to accomplish. Our goal is to create an application that suggests new friends that are not already the central user s friends, but we are testing the ability of the model to predict whether a given user is already friends with the central user. Accurately testing the first goal would require a large set of hand-labeled data; we decided it would be better to use the large amount of already existing data as a proxy. For the second task of determining a central user s closest friends, we do not have any labeled training data at all, so it is difficult to make a precise evaluation of our model s performance. 4. Friend Recommendation using Logistic Regression We trained a linear logistic classifier to classify non-labeled samples as friends or non-friends. Initially, we used a training set of 5 non-friends and approximately 3 friends for each user. For finding the optimal classifier, we preferred Newton s Method due to its rapid convergence and the non-singularity of our feature set. Features used were the ones mentioned above, i.e. number of common friends, groups, activities etc. We used K-Fold Cross Validation (with K=) to find our test error. Accuracy rates on our first run were highly satisfactory given the acceptable level of accuracy for our application. Classification error was on average between % and %. The test error rate increased as we increased the number of features included, despite the fact that each feature was individually a decent friendship indicator. Figure 1 shows how testing error varied with increasing number of training samples, and different graphs shows error rates for different number of features included Logistic Regression with 2 Features Figure 1 Data reduced to 2-Dimensions for Logistic Regression with 4 Features Logistic Regression with 6 Features As seen in the plots, training error was significantly less than test error with higher number of features. Further, test error declined with increasing number of training samples. We interpreted these results as indicators of high variance and acquired more data to increase our training set to nonfriends and around 6 friends. Expanding our training set significantly improved our results, lowering error rates to 12%, 11% and % for 2, 4 and 6 features respectively. 4.1 Weighted Logistic Regression Basic logistic regression s main drawback is that it punishes false negatives the same as false positives. However, since our application would only be reporting the samples that it would classify as positive, we should lower error on positive guesses as much as possible. This could be achieved by weighing negative samples more than positive samples while training. This approach would put more emphasis on maximizing the likelihood of negative samples and shift the separating line closer to positive samples, making our hypothesis less likely to guess positive samples. As seen in Figure 2, as the weight assigned to negative samples increased, the percentage error on positively guessed samples (i.e. mislabeling of negative data) and the number of positive guesses decreased.
3 %Error Error on Negative Samples 5 Figure 2 Number of Guesses Number of Positive Guesses 5 5 %Error Total Error 5 n=2,friends n=4,..&groups n=6,..&events 4.2 Conclusions on Logistic Regression A major challenge is figuring out the optimal weight on the negative samples. We deferred tackling this issue until we gathered sufficient user feedback to decide whether the number of recommendations or their accuracy is a bigger priority. 5. Naïve Bayes We also trained a NB model based on features that each person might have in common with the central user. We trained both a Bernoulli and a multinomial model. In the Bernoulli model, for a given feature of a user (e.g., groups), we let P(x i = 1 y) be the probability that the user s i th group was also shared by the central user. However, we considered the possibility that some groups might be more indicative than others. Therefore, we also created a lexicon of the groups to which the central user belongs and trained a multinomial model in which we let P(x i = j y) be the probability that the user s i th group was the same as the central user s j th group. If j =, then it is the probability that the user s i th group is not shared by the central user. (Note that the multinomial model distinguishes between each group to which the central user belongs, but groups to which he does not belong are treated symmetrically.) 5.1 Modeling the Prior One difficulty in applying NB is modeling the prior P(y=1). We would normally set this parameter to be the percentage of the training examples which were positive, but as discussed above, our training set is not representative of the actual distribution. Furthermore, even if we base our prior on all of the data (not just the data in the training set), it may not match the prior for the question we are actually trying to answer; the probability that given user is friends with the central user may be different than the probability that a given non-friend is someone with whom the central user might like to be friends. However, the nature of our goal allows us to solve this problem. As mentioned above, instead of performing classification, we instead want to select the most likely candidates. Therefore, we rank the users by the likelihood ratio P(x y=1)/p(x y=), which is equivalent to ranking by P(y=1 x). This way, we can avoid modeling the prior. 5.2 Measuring Accuracy Unfortunately, as mentioned above, this application of Naïve Bayes makes measuring the test error difficult. A simple method would be to let F be the number of friends in the test set, rank the test set by the likelihood ratio, and let the error be equal to the percentage of users in the top F scores that were actually non-friends. This metric (which we call metric A) has the advantage of focusing on false positives. However, in practice, we would probably only report the top few candidates, so we care the most about the ones at the top. Therefore, we also used metric B, which awards a higher weight to the test samples that the model ranks higher; specifically, w i = F+1 i (with weights renormalized to sum to 1). We also let metric C be the same as A, but with F/2 in place of F. The following plots show the test results for our model, where each column shows the test samples for a single central user. Friends are shown in blue, non-friends in red. We found the best model (Figure 3-b) to be the one that treats common friends by a multinomial model and other features as a Bernoulli model. For comparison, the model with all features treated as Bernoulli is shown in Figure 3-a. The error according to each of the three metrics is given as well.
4 Figure 3-a A = 24.5% B = 12.2% B = 7.33% Figure 3-b A =.7% B = 11.2% C = 7.65% 6. Unsupervised and Semi-Supervised Learning for Best Friend Suggestion The goal here is to cluster a user s friends based on their similarity, and report the members of the group closest to the user as the best friends Reduction to a Single Dimension using PCA An obvious challenge in unsupervised learning is defining a metric that determines the cluster to report. We tackled this issue by assuming that features were positively correlated with friendship level. To test this assumption, we created a feature set consisting of our friends and reduced the data to a single dimension by applying PCA. This single dimension, the principal eigenvector, indicates the direction of highest variation; hence, we postulated that the projection of the better friends would lie further away from the origin, indicating a higher number of common features. Using ourselves as the central user, this metric correctly identified the people we would consider as good friends. Although we could not prove results rigorously, this experiment indicated that projecting on the principal component could indicate level of friendship K-Means Clustering We clustered our data into four clusters (the number of clusters was arbitrarily decided) using the 8- dimensional feature set and reported a cluster based on our metric of closeness to the central user. Figure 4-a shows the results. It is noteworthy that we applied clustering in higher-dimensional space, rather than reducing data to a lower dimensional space and then reporting points based on their distance from the origin. The two methods would cluster points in similar but not identical ways. Whereas the former method clusters points based on their similarity with each other, the latter clusters them based on their similarity with the central user. Figure 4-b shows the results of the latter method. As further discussed below, it is hard to know which method yields better results without actually getting feedback from users Constrained K-Means Clustering The clustering methods covered so far did not accommodate user feedback, so we implemented a constrained K-Means clustering algorithm that would incorporate user feedback as labeling on data to be adhered to while clustering. User feedback would be in the form of good friend or not good friend. Our algorithm would ensure that sample points with same label would be assigned to the same cluster, and no cluster would contain points with different labels. In terms of implementation, constrained K- Means differs from regular K-Means in the assignment of the labeled samples to a centroid. While regular K-Means assigns each labeled sample to a centroid with the objective of minimizing that specific sample s Euclidian distance from the centroid, constrained K-Means finds the centroid yielding the smallest value for the sum of distances of all the samples with a specific label and assigns them to that centroid. To test this algorithm, we labeled the top outputs of unconstrained K-Means and ran constrained K-Means on the new semi-labeled data set. The results are shown in Figure 4-c. 6.4 Conclusions on Unsupervised and Semi-Supervised Learning Testing was a major challenge on our best friend suggestion algorithms. Given the nature of the problem, we could only test the algorithm on our own friends, and we did not have a coherent metric for gauging the accuracy of results. However, results were encouraging; 75% of the people reported by the
5 unconstrained K-Means algorithm were people that we would classify as good friends. Furthermore, a significant drawback of the semi-supervised learning algorithm we used was that it only adjusted clusters, not the definition of good friends, based on the user feedback. An algorithm that would adjust the weights of different features, such as Distance Metric Learning [1], could be more appropriate for the question at hand K-Means In Higher Dimensions K-Means In Lower Dimensions Figure 4-a Reduced to 2-Dimensions for Figure 4-b Constrained K-Means with Labeled Samples Figure 4-c Reduced to 2-Dimensions for 7. Conclusion In all of our algorithms, we found the best feature to be the percentage of friends that users had in common, followed by the percentage of common groups and events. One reason for this result is the sparseness of the data for many of the features. For example, many people have few events listed, so it is common to see users who have no events in common with the central user. For other features, such as activities and music, the problem is even worse because the entries for these categories are user generated (i.e. prone to spelling errors or writing the same thing in different ways), which makes it even less likely to see commonality among users. The sparseness of these features makes them harder to use than features with denser data, such as number of common friends. However, we believe the errors we found for our friend suggestion models show promise. Depending on the metric, test errors are roughly in the -% range. We believe that friend suggestion could be useful even if only a much smaller fraction (say, one out of five) of our results were relevant. However, we also realize that our error measures are by no means perfect. As discussed above, since we were testing the ability to predict current friends rather than suggesting new ones, we would ultimately need to test with actual users, and this is even more the case with our best friend predictor. Therefore, the next step in our work would be to incorporate this model into a Facebook application and gather user feedback. 8. References [1] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russel. Distance Metric Learning, with application to clustering with side-information. University of California, Berkeley.
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
OUTLIER ANALYSIS. Data Mining 1
OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Music Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic
A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
Information Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
Review Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Unit 9 Describing Relationships in Scatter Plots and Line Graphs
Unit 9 Describing Relationships in Scatter Plots and Line Graphs Objectives: To construct and interpret a scatter plot or line graph for two quantitative variables To recognize linear relationships, non-linear
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
ADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard ([email protected]) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
3 An Illustrative Example
Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
Exact Nonparametric Tests for Comparing Means - A Personal Summary
Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
Predicting Movie Revenue from IMDb Data
Predicting Movie Revenue from IMDb Data Steven Yoo, Robert Kanter, David Cummings TA: Andrew Maas 1. Introduction Given the information known about a movie in the week of its release, can we predict the
Conn Valuation Services Ltd.
CAPITALIZED EARNINGS VS. DISCOUNTED CASH FLOW: Which is the more accurate business valuation tool? By Richard R. Conn CMA, MBA, CPA, ABV, ERP Is the capitalized earnings 1 method or discounted cash flow
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
Introduction to Clustering
Introduction to Clustering Yumi Kondo Student Seminar LSK301 Sep 25, 2010 Yumi Kondo (University of British Columbia) Introduction to Clustering Sep 25, 2010 1 / 36 Microarray Example N=65 P=1756 Yumi
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Detecting Corporate Fraud: An Application of Machine Learning
Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Bayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Maximizing Precision of Hit Predictions in Baseball
Maximizing Precision of Hit Predictions in Baseball Jason Clavelli [email protected] Joel Gottsegen [email protected] December 13, 2013 Introduction In recent years, there has been increasing interest
Statistical tests for SPSS
Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly
CS229 Project Report Automated Stock Trading Using Machine Learning Algorithms
CS229 roject Report Automated Stock Trading Using Machine Learning Algorithms Tianxin Dai [email protected] Arpan Shah [email protected] Hongxia Zhong [email protected] 1. Introduction
Introduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Early defect identification of semiconductor processes using machine learning
STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
Personalized Hierarchical Clustering
Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Data Mining Individual Assignment report
Björn Þór Jónsson [email protected] Data Mining Individual Assignment report This report outlines the implementation and results gained from the Data Mining methods of preprocessing, supervised learning, frequent
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
E3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Logistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement
Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market
171:290 Model Selection Lecture II: The Akaike Information Criterion
171:290 Model Selection Lecture II: The Akaike Information Criterion Department of Biostatistics Department of Statistics and Actuarial Science August 28, 2012 Introduction AIC, the Akaike Information
not possible or was possible at a high cost for collecting the data.
Data Mining and Knowledge Discovery Generating knowledge from data Knowledge Discovery Data Mining White Paper Organizations collect a vast amount of data in the process of carrying out their day-to-day
Analysing Questionnaires using Minitab (for SPSS queries contact -) [email protected]
Analysing Questionnaires using Minitab (for SPSS queries contact -) [email protected] Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:
