Sentiment analysis using emoticons
|
|
|
- Angelica Elliott
- 9 years ago
- Views:
Transcription
1 Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was to apply machine learning algorithms to determine what the emotion of an author is based on the contents of his/her tweets. Our assumption was that we can judge whether an author is happy or sad based on his/her choice of words. Preprocessing and feature extraction: For the purpose of training our classifiers we used the Twitter dataset [1]. It is a large dataset and easy to obtain. Tweets (posts on twitter.com by twitter users) are short and concise, usually no more than 1 or 2 sentences long. Sentences were assumed to have certain emotions associated with them (Happy, Sad, Angry, Neutral etc.). Ideally, human labeling of such sentences as conveying a particular emotion would have been a good approach. But, considering the size of the dataset (an estimated 300 million tweets) this would have been highly impractical. Hence, we decided to exploit emoticons to help us label our training tweets as Happy or Sad. The assumption was that if a person used a happy emoticon, then that person was probably happy at the time of posting the tweet. The same applies to a sad tweet. A typical tweet in our dataset would look something like the one shown in Figure 1. Figure 1 1
2 Please note that the tweet in Figure 1 is a fictitious tweet, but the format in the dataset is the same as the one shown. Information in the tweet that was not required for the purpose of training our classifiers, like the user names, tweet dates and URLs, were removed. Stop word like a, are, be etc. were also removed. In addition to this, very infrequent words were also removed as these may not have contributed much to the training. Only tweets with happy and sad emoticons were retained. For this project we are considering only tweets containing happy and sad emoticons because: 1) They are rarely used together in the same tweet and 2) Other emoticons are rarely used, therefore they may not contribute much to the training. Non-standard words such as LOL or ROTFL were not removed because they are words that sometimes have a high correlation with the emoticon being used and usually signify some emotion. Unbalanced training data was another problem that we came across. The ratio of happy tweets to sad ones was 9 to 1. We believe this was biasing our classifier s prediction towards the happy class, therefore we added more Sad tweets to the training data set. Nearly 440,000 such tweets were shortlisted. Tweets were converted into a bag of words format. We are ignoring ordering of the words for our classification. We are also maintaining a dictionary of all the words which have appeared at least once. Description of Machine Learning Algorithms used Naïve Bayes Classifier We are modeling our bag of words as unigrams (single worded dictionary), i.e. we are assuming that occurrence of each word given the class is independent of any other word in the sentence for the same class. 2
3 Mathematically: Out of Dictionary Words (ODW) are another problem with the Naïve Bayes classifier. Words in a testing sample which have not been seen in the training phase would have a probability of zero, which is not desirable since it will be multiplied by other probabilities resulting in a zero probability for ( ). While implementing our Naïve Bayes classifier we used some of the concepts from a paper by David Ahn & Balder ten Cate [2]. The paper mentions a technique called Laplace s law of smoothing, and we have used it with a slight variation. For dictionary words we used the below formula: For the ODWs we are using the following formula: Here we describe how we came up with this modified method of smoothing. For this purpose we are building a Virtual Tweet which is a long tweet contains all the words in the dictionary, plus a word to represent any unseen words. Thus, in this set up, probabilities are calculated as the above equations. Another interesting problem with Naïve Bayes classifier that we came across via this paper was the possibility of underflow due to repetitive multiplications of small probabilities. To solve this problem we added the logs of the probabilities, instead of multiplying the probabilities. Assuming that we have a sample testing tweet as where w i is a word in that tweet, and C j is a class, then C j 3
4 K-Nearest Neighbor classifier Two flavors of the K-Nearest Neighbor classifier were used. Centroid-based Nearest Neighbor Since we already have 2 clusters that contain tweets that are labeled as Happy and Sad, we calculate the centroid of these clusters, and check whether a new tweet that needs to be classified is more similar to the centroids of the Happy and Sad clusters. K in this case is effectively 1. The centroid for the cluster i can be calculated using the following formula: (*DW: Dictionary Word, N:Dictionary Size) [ ] Figure 2 Figure 2 describes this approach. Figure 2 has two clusters whose elements are either red squares or blue rhombuses. The X and the green triangle are the centroids of the respective clusters. And the Black dot is the element that needs to be classified. For each class we will calculate the Cosine or Jaccard similarity [3:74] of the centroid of that class and the testing tweet. The class whose centroid has the higher similarity will be declared the predicted class for the testing tweet. Below is the formula for calculating the similarity using the Cosine measure: i i 4
5 And below is the formula for calculating the similarity using the Jaccard measure: i i K-Nearest Neighbors Using the traditional K-Nearest Neighbor classifier, when a testing tweet came in to be classified as Happy or Sad, we would find the K most similar tweets in the training dataset. If majority of the K most similar tweets were Happy tweets, then the new tweet would be classified as a Happy tweet. Otherwise, it would be classified as a Sad tweet. K was always chosen to be an odd number, so that a tweet would either be classified as either Happy or Sad and not both. We used the same Cosine or Jaccard similarity measures as the centroid based nearest neighbor classifier. Results and Method of training and testing In all test cases, a testing tweet was said to have been classified accurately if the label (happy or sad) predicted by the classifier was the same as the label (the emoticon) that existed for that testing tweet. For testing the K-nearest neighbor classifier, we chose a much smaller data set 10,000 tweets. The reason why we chose to use a smaller dataset is because the K-nearest neighbor algorithm is very slow. Larger the training data set, slower the algorithm. We then did a 10-fold cross validation on the data set. Figure 3 shows a plot of the accuracy vs. the value of K for Cosine and Jaccard similarity measures. The data set used in this case included randomly chosen tweets that had happy or sad emoticons. 5
6 Figure 3 In another case, we tried varying the size of the Figure training 4 data set. The training set had tweets that had n In another case, we tried varying the size of the training data set. The training set had tweets that had an (almost) equal number of happy and sad tweets. The same training set was used for the Naïve Bayes classifier as well as both flavors of the nearest neighbor classifiers. The testing set comprised of 1000 randomly chosen tweets with happy and sad emoticons. The same testing data set was used for all three 6
7 classifiers. Figure 4 shows a plot of how the accuracy varies with the size of the training data set for all three classifiers. Lastly, we also tested the Naïve Bayes classifier with no smoothing, with smoothing, and smoothing with log probabilities. Figure 5 shows a plot of the accuracy vs. size of the training dataset for all three methods. Figure 5 Discussion 1) Our accuracy would not improve much beyond a certain point. On further analysis we discovered that people used emoticons in different ways than we expected. This may imply that emoticons are perhaps not the best labels for sentiment analysis. 2) Smoothing improved the accuracy of the Naïve Bayes classifier. Words in a testing sample which had not been seen in the training phase would have a probability of zero, which when multiplied 7
8 by other probabilities would result in a zero probability for ( ), possibly leading to misclassification. 3) Log probabilities for the Naïve Bayes classifier gave us substantially better results. We assume that this is due the avoidance of underflow caused by multiplying very small probabilities. 4) We didn t handle negation. It s possible we may have gotten better results if we had handled it. There were 5625 occurrences of negations in 93,000 tweets. 5) We didn t take into account sentence structure. We re not sure if this would increase the accuracy of classification by much, since people on twitter often do not follow sentence structures that we would normally learn in school. 6) We had initially planned to use the perceptron, but since our training dataset was so large, we were unsure about whether it would ever converge and even if it did, then how long it would take. We do not know if the feature space is linearly separable. 7) In the case of traditional K-NN, since each testing tweet needs to be compared with all the training tweets, the time complexity for each testing tweet is O(T) where T is the size the training dataset, which is quite large. In the case of the centroid based nearest neighbor, since the centroids are calculated only once, the time complexity is much lower. However, there is a tradeoff in terms of accuracy. 8) As the value of K is increased in the traditional K-NN classifier, the accuracy seems to increase. When K is small, it s possible that noisy training tweets may cause misclassification. 9) For large training sets, we discovered that the Jaccard similarity measure performs slightly better than the Cosine similarity measure. For smaller training sets though, they seem to be on par with each other. 8
9 Acknowledgements We would like to thank Dave Fuhry 1 for sharing the twitter data set with us. We would also like to thank Prof. Eric Fosler-Lussier 2 for his guidance. References: [1] Inc. (US), Tweets from 2008 and [2] David Ahn & Balder ten Cate. Simple language models and spam filtering with Naive Bayes, [3] Tan, Steinbach & Kumar, Introduction to Data Mining, 4 th ed., Pearson Education, Inc.,
Introduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
LCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement
Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement Ray Chen, Marius Lazer Abstract In this paper, we investigate the relationship between Twitter feed content and stock market
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Keywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Analysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction
Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
Monday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
Spam Filtering with Naive Bayesian Classification
Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011
A CRF-based approach to find stock price correlation with company-related Twitter sentiment
POLITECNICO DI MILANO Scuola di Ingegneria dell Informazione POLO TERRITORIALE DI COMO Master of Science in Computer Engineering A CRF-based approach to find stock price correlation with company-related
Investigation of Support Vector Machines for Email Classification
Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software
Machine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov [email protected] Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
End-to-End Sentiment Analysis of Twitter Data
End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India [email protected],
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Maschinelles Lernen mit MATLAB
Maschinelles Lernen mit MATLAB Jérémy Huard Applikationsingenieur The MathWorks GmbH 2015 The MathWorks, Inc. 1 Machine Learning is Everywhere Image Recognition Speech Recognition Stock Prediction Medical
Spam Filtering based on Naive Bayes Classification. Tianhao Sun
Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous
Sentiment Analysis for Movie Reviews
Sentiment Analysis for Movie Reviews Ankit Goyal, [email protected] Amey Parulekar, [email protected] Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 04 Digital Logic II May, I before starting the today s lecture
Semantic Sentiment Analysis of Twitter
Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 11 th International Semantic Web Conference
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Geometry Notes RIGHT TRIANGLE TRIGONOMETRY
Right Triangle Trigonometry Page 1 of 15 RIGHT TRIANGLE TRIGONOMETRY Objectives: After completing this section, you should be able to do the following: Calculate the lengths of sides and angles of a right
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Distances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
CSE 473: Artificial Intelligence Autumn 2010
CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though
Automatic Text Processing: Cross-Lingual. Text Categorization
Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Microsoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql [email protected] http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Decision Support System on Prediction of Heart Disease Using Data Mining Techniques
International Journal of Engineering Research and General Science Volume 3, Issue, March-April, 015 ISSN 091-730 Decision Support System on Prediction of Heart Disease Using Data Mining Techniques Ms.
Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition
Brochure More information from http://www.researchandmarkets.com/reports/2170926/ Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd
Creating a NL Texas Hold em Bot
Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the
2010 Solutions. a + b. a + b 1. (a + b)2 + (b a) 2. (b2 + a 2 ) 2 (a 2 b 2 ) 2
00 Problem If a and b are nonzero real numbers such that a b, compute the value of the expression ( ) ( b a + a a + b b b a + b a ) ( + ) a b b a + b a +. b a a b Answer: 8. Solution: Let s simplify the
Music Genre Classification
Music Genre Classification Michael Haggblade Yang Hong Kenny Kao 1 Introduction Music classification is an interesting problem with many applications, from Drinkify (a program that generates cocktails
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
Principal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
King Saud University
King Saud University College of Computer and Information Sciences Department of Computer Science CSC 493 Selected Topics in Computer Science (3-0-1) - Elective Course CECS 493 Selected Topics: DATA MINING
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Math Journal HMH Mega Math. itools Number
Lesson 1.1 Algebra Number Patterns CC.3.OA.9 Identify arithmetic patterns (including patterns in the addition table or multiplication table), and explain them using properties of operations. Identify and
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks
CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters
Support Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
FPGA Implementation of Human Behavior Analysis Using Facial Image
RESEARCH ARTICLE OPEN ACCESS FPGA Implementation of Human Behavior Analysis Using Facial Image A.J Ezhil, K. Adalarasu Department of Electronics & Communication Engineering PSNA College of Engineering
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
15-381 Spring 2007 Assignment 6: Learning
15-381 Spring 007 Assignment 6: Learning Questions to Einat ([email protected]) Spring 007 Out: April 17 Due: May 1, 1:30pm Tuesday The written portion of this assignment must be turned in at the beginning
Emoticon Smoothed Language Models for Twitter Sentiment Analysis
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Use of social media data for official statistics
Use of social media data for official statistics International Conference on Big Data for Official Statistics, October 2014, Beijing, China Big Data Team 1. Why Twitter 2. Subjective well-being 3. Tourism
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Music Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Forecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia [email protected] Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms Azwa Abdul Aziz, Nor Hafieza IsmailandFadhilah Ahmad Faculty Informatics & Computing
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Automated News Item Categorization
Automated News Item Categorization Hrvoje Bacan, Igor S. Pandzic* Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia {Hrvoje.Bacan,Igor.Pandzic}@fer.hr
Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
Beating the MLB Moneyline
Beating the MLB Moneyline Leland Chen [email protected] Andrew He [email protected] 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series
