Author Profiling: Predicting Age and Gender from Blogs
|
|
|
- Kimberly Greer
- 9 years ago
- Views:
Transcription
1 Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013 K Santosh, Romil Bansal, Mihir Shekhar, and Vasudeva Varma International Institute of Information Technology, Hyderabad {santosh.kosgi, romil.bansal, mihir.shekhar}@research.iiit.ac.in, [email protected] Abstract Author profiling is the task of determining age, gender, native language or personality type of author by studying their sociolect aspect, that is, how language is shared by people. In this paper, we propose a Machine Learning approach to determine unknown author s age and gender. The approach uses three types of features: content based, style based and topic based. We were able to achieve an accuracy of 64.08%, 64.30% for age and 56.53%, 64.73% for gender in English and Spanish respectively. Keywords: Author Profiling, Topic Modelling, Text Categorization, Natural Language Processing 1 Introduction The problem of identifying the user s profile from the text is always of importance as it helps in various fields like forensics and marketing. For example, in marketing, a manager might want to find the gender and age group of people who like or dislike their products from the public reviews. The increasing accessibility of public blogs offers an unprecedented opportunity to harvest information from texts authored by hundreds of thousands of different authors. In this paper, we tried to exploit these public blogs to find the relations between the author s profile and the language style used by them. The main idea behind this task is to analyse how everyday languages reflects basic social and personality traits. The profiling dimensions we considered are age and gender. 2 Approach 2.1 The Corpus We used the blog corpus provided by PAN 2013[1]. The corpus consisted of blogs written in both English and Spanish and each blog is written by either male or female and belongs to one of three age groups(10s: 13-17, 20s: and 30s: 33-47). The corpus is described in more details in Table 1.
2 EN ES 10s 20s 30s 10s 20s 30s Male 8,600 42,900 66,800 1,250 21,300 15,400 Female 8,600 42,900 66,800 1,250 21,300 15,400 Table 1. Blogs Distribution for English and Spanish Dataset 2.2 Features and Experiments Different people tend to write differently. These differences occur due to variations in the topics of interest and style of writing like word choices and grammar rules. For example, females tend to write more about wedding styles and male tends to write more about technology and politics. Further females use more adverbs and adjectives while writing compared to males[8]. We considered these differences in the writing styles and content of male and female bloggers of different ages. Overall we considered three different types of features that are useful for distinguishing between different categories; they are: content based features, style based features and topic based features. These features are described in details below. Content Based Features Male and female authors tend to speak about different topics, so they will use different words. Thus content based features are important to distinguish between male and female bloggers[9]. For example, a blog related to cricket is more likely to be written by a male author rather than a female. A blog related to cricket may contain words like cricket, no ball, wide, world cup, icc world cup etc. Thus the occurrence of words like world cup, cricket will increase the chances of it being written by male rather than female blogger and occurrence of words or phrases like my husband, pink, boyfriend will increase the chances of it being written by female. The words which are used more frequently by one of the classes when compared to other can be used as features. We calculated the frequencies of different N-grams in the documents written by a particular gender. Then, for every N-gram, we calculated the ratio of its frequencies in the blogs written by male and female bloggers. We took the top k N-grams (We used k as and for English and Spanish gender analysis respectively) that differentiate males from females and females from males as features. Similarly, teenagers tend to write more about their friends and mood swings, whereas people of 20 s write more about college life and people of 30 s write more about marriage, jobs and politics. Thus content based features are important to distinguish between bloggers belonging to different age groups. Again, the words with most skewed ratios are used as features. We used k as for both English and Spanish age analysis. Style Based Features Style based features includes N-grams of POS tags in documents, punctuation symbols and number of href links[2,9]. For each of these features we calculated its frequency with which it appears in the corpus. We used their normalised count for creating numerical vector. This was the only language dependent feature.
3 Topic Based Features N-gram based approach models the top words used by both males and females. But many times same words are used in different contexts. For example, males usually use words like daily life to describe their work and whereas females use daily life to describe their love or spiritual life. Males use dresses in context with pants and coats whereas females use dresses with words like bridal wears and gowns etc. Topic based features consider the fact that different categories of people have different topic of interests. We tried to model these differences to predict age and gender of the person. We ran LDA 1 algorithm to find topics from the blog and created a machine learning model based on the probability distribution of the blog over different topics and the class it is in. For extracting the topic based features we divided the training data created in ratio 60%and 40%. The 40% of the data is used to train the MaxEnt model to predict the class based on the topic distribution of the blog. The rest 60% of the data we used for extracting relevant topics from the blogs. The topics were extracted as follows. Overall Topics We gave the complete 60% of the data to generate topics from the blogs. The intuition was that the different category of people tends to write on completely different topics. So modelling the users based on the topics would tell us the class of the people the author belongs. Using this approach we achieved 52.3%(using 200 topics 2 ) accuracy for gender classification. We analysed the topics of the blogs that are getting misclassified by method. We analysed that although few topics completely distinguish between males and females but most of the topics are written by both males and females. For example, the topic corresponding to dresses and shopping was thought to be written by mostly females but males were also blogging about the topic. This causes the algorithm to find topics distribution vector that could distinguish between males and females completely. Similar case was observed with the different age groups. Individual Topics Even if males and females write on the same topic, the words or context used by them to describe the topic is different. This could be seen from the above example as males are talking about pants and coats in the blogs for topic dressing and shopping whereas females are talking about bridal wear and dresses in the similar topic. The method of Overall Topics classified both in the same topic, thus making the topic noisy. So to improve the creation of topics, we trained the topics separately for individual classes and predicted the distribution over all of them. This helps us to model the context in which the topic was spoken about. Using this approach we obtained the overall accuracy of 54% for gender classification. We created 200 topics for each gender and 100 topics for each age group while creating the model. Hybrid Topics The above method gave better results, but some of the overall topics are good enough to distinguish between different classes. So we created feature vector as probability distribution over both individual as well as overall topics. We took 200 topics from each gender and 100 topics from each age group along with 200 overall topics. Using this approach, we obtained the overall accuracy of 54.8% accuracy for gender classification We experimented using different number of topics and found 200 topics to perform the best.
4 2.3 Learning Methods We used the decision tree of classifiers to predict the class. We divided our corpus into three parts. We trained the ML algorithm using content based, style based and topic based features separately using the first part. We tested these models on the second part and the output is used to train the final decision tree classifier. The third part is used as a testing set. The table 2 shows Machine learning methods used to build classifiers. Feature Name Feature Description ML Algorithm Used ML Library Used Content Based Features Ngrams SVM SVM light[5] Style Based Features Ngrams of POS tags SVM SVM light Topic Based Features LDA Topic Model MaxEnt Mallet[6] Merged Features Scores of classes from different models Decision Tree Mallet Table 2. Features used while training the models. 3 Conclusion and Future Work A good system for author profiling is required in various domains ranging from analysing sensitive text for national security to commercially important data from various comments and product reviews. In our approach, we tried to model the author s profile using the writing style and content of the blog. We have shown that best results were acchieved when the context information is used along with the content and style of the blog. Future efforts can be put into inducing sentiment analysis to discover more differences in text written by authors representing different classes. With further developments, we can expect much better accuracy rates in identifying the author s profile. References 1. Pan author profiling task (2013), 2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Automatically profiling the author of an anonymous text. Commun. ACM 52(2), (Feb 2009), 3. Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17, (2003) 4. Bamman, D., Eisenstein, J., Schnoebelen, T.: Gender in twitter: Styles, stances, and social networks. CoRR abs/ (2012) 5. Joachims, T.: Advances in kernel methods. chap. Making large-scale support vector machine learning practical, pp MIT Press, Cambridge, MA, USA (1999),
5 6. McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), mccallum/mallet 7. Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd international workshop on Search and mining user-generated contents. pp SMUC 11, ACM, New York, NY, USA (2011), 8. Pennebaker, J.: The Secret Life of Pronouns: What Our Words Say About Us. Bloomsbury USA (2013), 9. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of Age and Gender on Blogging. In: Proc. of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs (Mar 2006)
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
Effects of Age and Gender on Blogging
Effects of Age and Gender on Blogging Jonathan Schler 1 Moshe Koppel 1 Shlomo Argamon 2 James Pennebaker 3 1 Dept. of Computer Science, Bar-Ilan University, Ramat Gan 52900,Israel 2 Linguistic Cognition
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India [email protected] 2 School of Computer
Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
Sentiment analysis for news articles
Prashant Raina Sentiment analysis for news articles Wide range of applications in business and public policy Especially relevant given the popularity of online media Previous work Machine learning based
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Blogs and Twitter Feeds: A Stylometric Environmental Impact Study
Blogs and Twitter Feeds: A Stylometric Environmental Impact Study Rebekah Overdorf, Travis Dutko, and Rachel Greenstadt Drexel University Philadelphia, PA {rjo43,tad82,greenie}@drexel.edu http://www.cs.drexel.edu/
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
Sentiment analysis: towards a tool for analysing real-time students feedback
Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: [email protected] Mihaela Cocea Email: [email protected] Sanaz Fallahkhair Email:
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Predicting Age and Gender in Online Social Networks
Predicting Age and Gender in Online Social Networks Claudia Peersman Antwerp University & Artesis CLiPS Lange Winkelstraat 40 BE-2000 Antwerp (Belgium) (+32) (0)3 265 5225 [email protected] Walter
Forecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia [email protected] Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
TDPA: Trend Detection and Predictive Analytics
TDPA: Trend Detection and Predictive Analytics M. Sakthi ganesh 1, CH.Pradeep Reddy 2, N.Manikandan 3, DR.P.Venkata krishna 4 1. Assistant Professor, School of Information Technology & Engineering (SITE),
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.
Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Akshay Amolik, Niketan Jivane, Mahavir Bhandari, Dr.M.Venkatesan School of Computer Science and Engineering, VIT University,
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.
Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach. Pranali Chilekar 1, Swati Ubale 2, Pragati Sonkambale 3, Reema Panarkar 4, Gopal Upadhye 5 1 2 3 4 5
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances
Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and
Inner Classification of Clusters for Online News
Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant
Role of Social Networking in Marketing using Data Mining
Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
RRSS - Rating Reviews Support System purpose built for movies recommendation
RRSS - Rating Reviews Support System purpose built for movies recommendation Grzegorz Dziczkowski 1,2 and Katarzyna Wegrzyn-Wolska 1 1 Ecole Superieur d Ingenieurs en Informatique et Genie des Telecommunicatiom
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Comparing Methods to Identify Defect Reports in a Change Management Database
Comparing Methods to Identify Defect Reports in a Change Management Database Elaine J. Weyuker, Thomas J. Ostrand AT&T Labs - Research 180 Park Avenue Florham Park, NJ 07932 (weyuker,ostrand)@research.att.com
Why do people publish weblogs? An online survey of weblog authors in Japan
Why do people publish weblogs? An online survey of weblog authors in Japan A. Miura 1 & K. Yamashita 2 1 Department of Psychology, Kobe Gakuin University, Japan 2 School of Network and Information, Senshu
Football Match Winner Prediction
Football Match Winner Prediction Kushal Gevaria 1, Harshal Sanghavi 2, Saurabh Vaidya 3, Prof. Khushali Deulkar 4 Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai,
Data Mining. Toon Calders TU Eindhoven
The Dangers of Data Mining Toon Calders TU Eindhoven Motivation for Data Mining: the Data Flood Huge amounts of data are available in digital form Internet IP Traffic logs Scientific data Customer profiles
EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD
EFFICIENTLY PROVIDE SENTIMENT ANALYSIS DATA SETS USING EXPRESSIONS SUPPORT METHOD 1 Josephine Nancy.C, 2 K Raja. 1 PG scholar,department of Computer Science, Tagore Institute of Engineering and Technology,
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Spatio-Temporal Patterns of Passengers Interests at London Tube Stations
Spatio-Temporal Patterns of Passengers Interests at London Tube Stations Juntao Lai *1, Tao Cheng 1, Guy Lansley 2 1 SpaceTimeLab for Big Data Analytics, Department of Civil, Environmental &Geomatic Engineering,
Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction
Sentiment Analysis of Movie Reviews and Twitter Statuses Introduction Sentiment analysis is the task of identifying whether the opinion expressed in a text is positive or negative in general, or about
Sentiment Analysis on Big Data
SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social
Interest Rate Prediction using Sentiment Analysis of News Information
Interest Rate Prediction using Sentiment Analysis of News Information Dr. Arun Timalsina 1, Bidhya Nandan Sharma 2, Everest K.C. 3, Sushant Kafle 4, Swapnil Sneham 5 1 IOE, Central Campus 2 IOE, Central
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Machine Learning Log File Analysis
Machine Learning Log File Analysis Research Proposal Kieran Matherson ID: 1154908 Supervisor: Richard Nelson 13 March, 2015 Abstract The need for analysis of systems log files is increasing as systems
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control
Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control Andre BERGMANN Salzgitter Mannesmann Forschung GmbH; Duisburg, Germany Phone: +49 203 9993154, Fax: +49 203 9993234;
Social Media Implementations
SEM Experience Analytics Social Media Implementations SEM Experience Analytics delivers real sentiment, meaning and trends within social media for many of the world s leading consumer brand companies.
Twitter sentiment vs. Stock price!
Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured
ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING
ELEVATING FORENSIC INVESTIGATION SYSTEM FOR FILE CLUSTERING Prashant D. Abhonkar 1, Preeti Sharma 2 1 Department of Computer Engineering, University of Pune SKN Sinhgad Institute of Technology & Sciences,
How To Write A Summary Of A Review
PRODUCT REVIEW RANKING SUMMARIZATION N.P.Vadivukkarasi, Research Scholar, Department of Computer Science, Kongu Arts and Science College, Erode. Dr. B. Jayanthi M.C.A., M.Phil., Ph.D., Associate Professor,
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications
A Unified Data Mining Solution for Authorship Analysis in Anonymous Textual Communications Farkhund Iqbal, Hamad Binsalleeh, Benjamin C. M. Fung, Mourad Debbabi Concordia Institute for Information Systems
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies
Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,
Simple maths for keywords
Simple maths for keywords Adam Kilgarriff Lexical Computing Ltd [email protected] Abstract We present a simple method for identifying keywords of one corpus vs. another. There is no one-sizefits-all
Bisecting K-Means for Clustering Web Log data
Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
Predicting Flight Delays
Predicting Flight Delays Dieterich Lawson [email protected] William Castillo [email protected] Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
Binary Logistic Regression
Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Identifying Focus, Techniques and Domain of Scientific Papers
Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
Can Twitter provide enough information for predicting the stock market?
Can Twitter provide enough information for predicting the stock market? Maria Dolores Priego Porcuna Introduction Nowadays a huge percentage of financial companies are investing a lot of money on Social
C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER
INTRODUCTION TO SAS TEXT MINER TODAY S AGENDA INTRODUCTION TO SAS TEXT MINER Define data mining Overview of SAS Enterprise Miner Describe text analytics and define text data mining Text Mining Process
A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS
A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS Stacey Franklin Jones, D.Sc. ProTech Global Solutions Annapolis, MD Abstract The use of Social Media as a resource to characterize
Evaluating Software Products - A Case Study
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATION: A CASE STUDY ON GAMES Özge Bengur 1 and Banu Günel 2 Informatics Institute, Middle East Technical University, Ankara, Turkey
A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research
145 A Decision Support Approach based on Sentiment Analysis Combined with Data Mining for Customer Satisfaction Research Nafissa Yussupova, Maxim Boyko, and Diana Bogdanova Faculty of informatics and robotics
Why are Organizations Interested?
SAS Text Analytics Mary-Elizabeth ( M-E ) Eddlestone SAS Customer Loyalty [email protected] +1 (607) 256-7929 Why are Organizations Interested? Text Analytics 2009: User Perspectives on Solutions
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Building A Smart Academic Advising System Using Association Rule Mining
Building A Smart Academic Advising System Using Association Rule Mining Raed Shatnawi +962795285056 [email protected] Qutaibah Althebyan +962796536277 [email protected] Baraq Ghalib & Mohammed
Sentiment Analysis for Movie Reviews
Sentiment Analysis for Movie Reviews Ankit Goyal, [email protected] Amey Parulekar, [email protected] Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing
Get the most value from your surveys with text analysis
PASW Text Analytics for Surveys 3.0 Specifications Get the most value from your surveys with text analysis The words people use to answer a question tell you a lot about what they think and feel. That
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Sentiment Analysis and Topic Classification: Case study over Spanish tweets
Sentiment Analysis and Topic Classification: Case study over Spanish tweets Fernando Batista, Ricardo Ribeiro Laboratório de Sistemas de Língua Falada, INESC- ID Lisboa R. Alves Redol, 9, 1000-029 Lisboa,
Establishing the Uniqueness of the Human Voice for Security Applications
Proceedings of Student/Faculty Research Day, CSIS, Pace University, May 7th, 2004 Establishing the Uniqueness of the Human Voice for Security Applications Naresh P. Trilok, Sung-Hyuk Cha, and Charles C.
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of
Data Mining Analytics for Business Intelligence and Decision Support
Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
The Advanced Guide to Youtube Video SEO
The Advanced Guide to Youtube Video SEO Tips and Techniques Part 1 by: Video Optimize 1. Youtube Channel The difference between a Youtube Channel that s brand new versus one that is 1 year old is really
