Keyword Extraction and Semantic Tag Prediction
|
|
|
- Silvia Curtis
- 9 years ago
- Views:
Transcription
1 Keyword Extraction and Semantic Tag Prediction James Hong Michael Fang Stanford University Stanford University Stanford, CA Stanford, CA [email protected] [email protected] Abstract Content on the web is often organized through user generated tags for intuitive search and retrieval. Such tags convey meta-information about the subject matter of the texts they represent. For this project, we applied machine learning (Bayesian co-occurrence, k-nn, SVM, NNS) to predict tags of StackExchange posts obtained from Kaggle: Facebook Recruiting - Keyword Extraction III. Using our non-parametric, fuzzy Nearest Neighbor Search algorithm, we achieved a F1-score of on a testing set with unseen data and on the Kaggle test set (containing many duplicate data points). Furthermore, when predicting a single tag per post, our algorithm attained approximately 71.1% accuracy (on unseen data), surpassing the 0.65 accuracy attained by Stanley & Byrne (2013). Our keyword-tag co-occurrence model and fuzzy NNS proved to be fast and practical for large-scale subject and tag prediction problems with tens-of-thousands of tags and training documents. 1. Introduction Tagging is one of the simplest ways to organize web content. Allowing users to specify subjects, metadata tags are vital for efficient search and indexing. Many contemporary social media sites support some system of content tagging (Stanley & Byrne, 2013). An automated tagging system to model user generation of tags would therefore be useful for purposes of tag suggestion, document and topic clustering, and tag validation. 2. Data Analysis and Preprocessing From the Kaggle competition, we obtained a training set with 6,034,195 StackExchange posts, consisting of the title, body, and tags, as well as a testing set consisting of 2,013,337 title and bodies of posts. These posts cover a broad range of topics from 110 StackExchange sites (e.g. StackOverflow, MathOverflow, AskUbuntu). Each post has up to 5 tags, and the number of tags is roughly modeled by a normal distribution with a mean of 2.89 and standard deviation of In accordance with Zipf s law, tag frequency drops off dramatically; the top 100 tags constitute 41.1% and the top 1,000 tags constitute 71.5% of all tags. Through data analysis we observed a high degree of redundancy in the raw datasets. For instance, the raw training set contained only 4,125,233 unique points. In addition, 1,048,575 points in the raw testing set are duplicated from the training set. In conducting our experiment, we elected to train and test our algorithms exclusively on non-duplicate data, since this would more closely resemble real-world conditions and studies by other researchers (Stanley & Byrne, 2013). The data was preprocessed as follows (with an emphasis on speed) 1 : 1. Create a custom lexicon using an English dictionary with stop words removed and terms extracted from the tag set Apply greedy multi-word matching to capture multi-word terms in the lexicon. 3. Filter out terms not in the lexicon (removing noise such as misspellings or arbitrary variable names in embedded code segments). 1 We also tried stemming, but it proved ineffective for anything other than space reduction later in the co-occurrence model as most keywords could not be stemmed. POS tagging proved to be too slow on large datasets. Our preprocessing procedure also had the effect of reducing the training dataset from over 7 GB to under 1 GB. 2 An alternative method of generating a domain specific lexicon for StackOverflow posts would be to use computer science textbook indexes as a corpus.
2 3. Evaluation Criteria We measured performance in two ways: the Mean F1-score, and accuracy predicting a single tag (both on non-duplicated test data, using hold-out cross validation for 10,000 labeled posts separated from the nonduplicated training set). F1-score is formulated below, where p is precision and r is recall, tp is true positive frequency, fp is false positive frequency, and fn is false negative frequency. 4. Model Selection F 1 = 2pr, p = tp, r = p+r tp+fp tp tp+fn We applied a variety of models to tag prediction. From our initial experiments with naïve Bayes and SVM classifiers for individual tags (e.g. to predict whether c# is a tag), we concluded that such an approach is intractable on large tag sets containing hundreds or thousands of tags. Instead we focused on algorithms that considered thousands of tags using heuristics (co-occurrence) to reduce search space. To measure learning rate, we trained each model on successively larger datasets. 4.1 (Greedy) Tag in Text Baseline In our baseline algorithm, we compiled a list of common tags. For each test example, we simply predicted whichever tags from the list were present in the text of the post. This returned an F1-score of Nearest Neighbor Search for Top 500 Tags 3 Compared to the greedy tag in text approach, Nearest Neighbor Search (NNS) on the top 500 tags provided a more realistic baseline for a simple tagging system. The algorithm is as follows: 1. For each tag y in the top 500, concatenate all training examples with tag y. Generate a bag-of-words (BOW) vector with frequency counts, and apply the TF-IDF transform with document frequency computed using the entire training set. (These vectors correspond to centroids for each tag.) 2. For each test example, generate a BOW vector and apply the TF-IDF transform. Compute the distance between the test example vector and every tag centroid vector using cosine similarity. 3. Rank the tags by similarity score to obtain predictions. There are many flaws with this approach. First, the top 500 tags account for only 61.8% of all the tags in all posts in the training set, imposing a strict upper bound on recall. Also, more efficient space partitioning algorithms for NNS break down in high dimensional feature spaces, making NNS computationally expensive (linear search). Consequently, we were unable to efficiently process any training set larger than 100,000 samples using NNS for 500 tags. However, as a baseline, we attained a F1-score of approximately 0.23 and an accuracy of 39.7% when predicting a single tag. 4.3 Tag-Keyword Co-occurrence Model Using our training examples, we constructed a co-occurrence matrix to model the joint probability of word and tag presence in a post. From this we can approximate upper bounds on the conditional probabilities for each tag, given a set of words in a query. This allows us to quickly identify keywords that are strongly correlated to individual tags For each tag y, estimate max P( y word) using the co-occurrence matrix. word query 2. Return the top k tags. 3 In the project milestone write-up, this NNS approach was referred to as Nearest Tag Distributions. 4 To reduce space complexity, non-keyword entries in the matrix can be removed. This serves as an extra filter against stop words.
3 While simple, this algorithm captures approximately 70% of the true tags within the first tags predicted. Thus, it serves as a potential intermediate filter for the k-nn and SVM is tag? binary classifiers SVM and k-nn Extensions to Co-occurrence Model The co-occurrence model suffered from high bias when predicting less salient tags (those lacking specific keywords). We attempted to remedy this bias by using additional text features based upon the word-tag results from the co-occurrence matrix, and feeding them to SVM and k-nn classifiers. This retains the advantage of working in lower dimensions (instead of working in the space of the lexicon).then, we applied forward search to determine the most relevant features on each algorithm, which were: the probability bounds given in the above, tag presence in text, and the overall tag probability. First, we used the k-nearest Neighbors algorithm with large k (50) to obtain probability estimates based on the proportion of positive neighbors. Next, we experimented on class-weighted and regularized SVM classifiers with linear, polynomial, rbf, and sigmoid kernels, using the functional margin as a measure of confidence for each candidate tag. For the SVM, the best performance was observed with an rbf kernel and positive:negative weights of 7:1. Both SVM and k-nn improved performance on small training sets, where the co-occurrence model was prone to over-fit. However, they were less effective on larger training sets. 4.4 (High Dimensional) Fuzzy Nearest Neighbor Search The main issue that plagued our 500-tag NNS algorithm was the high dimensionality of the feature space. To address the curse of high dimensionality, we implemented a fuzzy/approximate NNS algorithm for efficient lookup and computation of semantic similarity across all of the 42,048 possible tags observed in the training set. Using the keyword-tag co-occurrence model presented above, we narrowed the search space of possible tags to candidate tags, and applied linear NNS in this local search space. Our algorithm: 1. For a test example, use the keyword-tag co-occurrence model to predict candidate tags. 2. For each candidate tag, extract its corresponding row in the keyword-tag co-occurrence matrix as a tag centroid. To increase the salience of the keywords (top non-zero entries) corresponding to each tag, we zeroed out all but the largest 500 components and renormalized the centroids. 3. Compute cosine distance (similarity measure) between the test example (a normalized BOW vector) and each candidate tag-centroid, and re-rank the tags by similarity. 4. Output the tags corresponding to the k-nearest tag centroids. In subsequent experiments, we adjusted our fuzzy NNS similarity function to penalized candidate tags with low semantic similarity, and reward tags with high similarity. Whereas the keyword-tag co-occurrence model generally performed well on common and specific tags such as PHP, CSS, and C#, which correspond strongly to specific keywords, fuzzy NNS improved performance on more general tags with high semantic similarity (e.g. algorithms, parsing, image). Furthermore, fuzzy NNS improved precision when predicting similar, hierarchical tags (e.g. facebook, facebook-api, and facebook-graph-api). Predicting three tags per query, we achieved recall and precision on 10,000 unseen test examples when considering all 42,048 possible tags Results Table 1. Accuracy Predicting a Single Tag Model \ # of training examples k 100k 3000k Keyword-tag co-occurrence co-occurrence & k-nn co-occurrence & SVM Fuzzy NNS When only considering the top 200 tags, fuzzy NNS achieves precision of 0.450, recall of 0.655, and F1 of
4 F1-score Accuracy Performance on Predicting a Single Tag # of training examples Word-Tag Co-occurrence Co-occurrence & TF-IDF Co-occurrence & k-nn Co-occurrence & SVM Co-occurrence & HD Fuzzy NNS Co-occurrence & HD fuzzy NNS & multiword features Figure 1. On the single tag prediction benchmark, our high-dimensional fuzzy-nns algorithm performed 7.9% better than the keyword-tag co-occurrence model alone when trained on 3,000,000 training examples. On smaller training sets, the k-nn classifier (on intermediate features) with keyword-tag co-occurrence improved test accuracy as the cooccurrence model (even with smoothing) is highly prone to over-fitting on small training sets (<100,000). Table 2. F1-Score Performance (Predicting 3 tags per query) Model \ # of training examples k 100k 3000k Keyword-tag co-occurrence co-occurrence & k-nn co-occurrence & SVM Fuzzy NNS Tagging Performance on F1-measure # of training examples Tag in text? Nearest neighbors (500 tag baseline) Word-Tag Co-occurrence Co-occurrence & TF-IDF Co-occurrence & k-nn Co-occurrence & SVM Co-occurrence & HD fuzzy NNS Co-occurrence & HD fuzzy NNS & multiword features Figure 2. Similar to the single tag metric, the fuzzy-nns algorithm attained significantly better F1 performance ( ) over our standalone co-occurrence algorithm and ( ) top-500 tag NNS algorithm. Results in Kaggle Competition: Using title string hashing to identify duplicates in the Kaggle test set and then fuzzy NNS to tag the remaining non-duplicates, we achieved a F1-score of on the Kaggle competition metric, and ranked 14 of 280 teams at the time of submission and 22/301 as of 12/12/ Discussion The main challenges of the StackExchange tag prediction problem include: (1) the large number of possible tags, (2) the large number of word/multi-word features, (3) the size of the training/testing sets, (4) the difficulty of measuring semantic similarity between domain-specific terms, and (5) high variability in user tagging behavior. 6 For each test query, we returned exactly 3 tags, regardless of how many tags are in the solution. This is one area that can be optimized for better performance on the F1-metric used by Kaggle.
5 Our co-occurrence model addresses (1) and (2), allowing us to estimate the probability of a tag. Furthermore, the co-occurrence matrix generated from our 3,000,000 point training set can be easily stored to fit into main memory, allowing for high performance keyword extraction. To reduce the dimensionality of our features and estimate semantic similarity between documents, we originally implemented a LSI model. However, LSI does not scale well to large training sets with over 1 million data points. Instead, tag centroids generated from the keyword-tag co-occurrence were more efficient to compute 7 and allowed us to process the entire Kaggle testing set in reasonable time and overcome the issues of high dimensionality (3) with considerable improvement in precision and recall over the co-occurrence model without fuzzy NNS (4). Finally, (5) users varied on the number and the specificity of tags. For this reason, accuracy predicting only 1 tag and F1-score on all tags are both critical metrics of a tagger s performance, the former testing accuracy on overall subject prediction, and the latter on the viability of the algorithm as a tag recommendation system. Using a regression model to predict the number of tags would probably improve performance on F1 and Kaggle rankings. Finally, compared with the StackExchange prediction study by Stanley and Byrne (2013), which used an ACT-R inspired co-occurrence to achieve 65% accuracy on the single tag metric, fuzzy NNS achieved better accuracy (71%) with similar training set sizes. The difference is most likely the product of re-ranking candidate tags by tag-centroid similarity. Ultimately, our results demonstrate that a fuzzy NNS algorithm based on tag-keyword co-occurrence and cosine distance can be an effective method for large scale tag and subject classification given enough training data. This relates to similar problems such as hashtag prediction, tag recommendation, and search and indexing. An additional application for tag prediction would be to filter spam/irrelevant tags and posts (Stanley & Byrne, 2013). 7. Future Directions Here are five possible areas of improvement. In order to further improve precision and recall of the tagging fuzzy NNS algorithm, (i) a more sophisticated similarity function can be used to estimate semantic similarity. For instance, Wikipedia pages or Google search results could be used to generate expanded text representations for less common keywords (Sahami & Heilman, 2006). Additionally, syntactic parsing strategies could improve data preprocessing. (ii) Unsupervised clustering can be applied training examples with common tags, so as to model multiple tag-sub-centroids for more precise similarity comparison. (iii) Applying locality sensitive hashing (MinHash) could allow for tractable NNS against not only tag-centroids but the training examples themselves (Das, Datar, Garg, & Rajaram, 2007). (iv) For practical subject identification systems, modeling tag specificity through tag-tag hierarchies can better balance the trade-off between accuracy and specificity in tag prediction (Deng, Krause, Berg, & Li, 2012). Finally, (v) using distributed computing would allow more sophisticated preprocessing steps, feature extraction, and semantic similarity measures. 8. Bibliography [1] Das, A., Datar, M., Garg, A., & Rajaram, S. (2007). Google News Personalization: Scalable Online. WWW 2007 (pp ). Banff: ACM. [2] Deng, J., Krause, J., Berg, A. C., & Li, F.-F. (2012). Hedging Your Bets: Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition. CVPR, (pp ). Providence. [3] Kaggle. (2013). Facebook Recruiting III - Keyword Extraction. Retrieved from Kaggle: [4] Kuo, D. (2011). On Word Prediction Methods. Berkeley: UCB/EECS. [5] Lott, B. (2012). Survey of Keyword Extraction Techniques. [6] Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge: MIT Press. [7] Sahami, M., & Heilman, T. D. (2006). A Web-based Kernel Function for Measuring the Similarity. WWW 2006 (pp ). New York: ACM. [8] Stanley, C., & Byrne, M. D. (2013). Predicting Tags for StackOverflow Posts. ICCM (pp ). Ottawa: ICCM. 7 Using fuzzy NNS, we were able to predict tags for the entire Kaggle test set (~2 million queries) on a 2012 MacBook Air with a 1.8 GHz DC ULV processor and 8GB of RAM in the span of 5 hours.
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Forecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia [email protected] Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006
Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Self Organizing Maps for Visualization of Categories
Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, [email protected]
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
Applying Machine Learning to Stock Market Trading Bryce Taylor
Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Collaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
Music Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
The Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
Similarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework
An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University
Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r
Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access
Analysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
Automatic Text Processing: Cross-Lingual. Text Categorization
Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
Sentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING
I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Big Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
WE DEFINE spam as an e-mail message that is unwanted basically
1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir
Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. [email protected]
Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang [email protected] http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang
Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform
Tracking and Recognition in Sports Videos
Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey [email protected] b Department of Computer
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Machine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Identifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching
Assisting bug Triage in Large Open Source Projects Using Approximate String Matching Amir H. Moin and Günter Neumann Language Technology (LT) Lab. German Research Center for Artificial Intelligence (DFKI)
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Personalized Hierarchical Clustering
Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca [email protected] Spain Manuel Martín-Merino Universidad
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.
White Paper Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset. Using LSI for Implementing Document Management Systems By Mike Harrison, Director,
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
