Twitter Data Mining for Events Classification and Analysis
|
|
|
- Howard Gallagher
- 10 years ago
- Views:
Transcription
1 Twitter Data Mining for Events Classification and Analysis Nausheen Azam, Jahiruddin, Muhammad Abulaish, SMIEEE, and Nur Al-Hasan Haldar School of IT, Centre for Development of Advanced Computing (CDAC), Noida, India Department of Computer Science, Jamia Millia Islamia (A Central University), Delhi, India Centre of Excellence in Information Assurance, King Saud University, Riyadh, KSA Abstract The increasing popularity of the micro-blogging sites like Twitter, which facilitates users to exchange short messages (aka tweets) is an impetus for data analytics tasks for varied purposes, ranging from business intelligence to nation security. Twitter is being used by a large number of users for events update and sentiment expression. Since tweets are generally unstructured in nature and do not follow grammatical structures, parsing techniques generally do not work well due to incorrect parts-of-speech assignment to individual words. In this paper, we have proposed an n-gram based statistical approach to identify significant terms and using them for vector-space modelling of the tweets. Thereafter, a social graph generation method is proposed, considering tweets as nodes and the degree of similarity between a pair of tweets as a weighted edge between them. The social graph is decomposed into various clusters using Markov Clustering technique, wherein each cluster corresponds to a particular event. The experiment is carried out using a corpus of 3100 tweets related to Israel-Gaza conflicts, Delhi assembly election, and union budget The experimental results are encouraging, showing the efficacy of the proposed social graph generation and event classification methods. Index Terms Social network analysis; Twitter data mining, Social graph generation; Markov clustering; Event classification. I. INTRODUCTION The recent advancements in Web technologies have attracted a large number of internet users to use online social networks like Facebook and Twitter for varied purposes, including events update and data sharing. As a result, social network applications are emerging as a powerful online tool for users to express and share their views with other users around the globe. Twitter is one such social media application with a large and rapidly growing user base. It has become the most popular micro-blogging social networking website in which users share their views in the form of very short message limited to 140 characters called tweets. Besides events update and data sharing, Twitter is also being used for many other purposes, including product marketing, political campaign, and market research. In addition, Twitter is also being used by the users to express their opinions and views about prominent issues of day-to-day life that may be social, political, or entertainment. Analyzing tweets to spot emerging issues and trends and to assess public opinion concerning topics and events is of considerable interest to various stakeholders, including government, companies, and security agencies. Corresponding author. [email protected], Tel However, performing such analysis is technically challenging due to unstructured nature of tweets and that the opinions of the users are typically expressed as informal communications and are buried under the pile of vast and largely irrelevant data generated by the millions of users and other online content producers. One of the ground challenges in analyzing Twitter data is their classification on the basis of the events under discussion, which is generally conceptualized using a set of significant terms embedded within the tweets. For example, Israel-Gaza conflict event can be conceptualized using the key terms israel, gaza, palestinian, hammas, peace, etc., whereas election, vote, party, etc. can be used to conceptualize the event Delhi assembly election. In this paper, we present a statistical approach to analyze twitter data using the concepts of social network generation and graph-based clustering. Tweets are tokenized using n- gram technique and analyzed using Latent Dirichlet Allocation (LDA) method to identify significant key terms, which are later on used for social network generation. Finally, Markov Clustering method is applied on the generated social network to identify dense regions (aka communities or cliques), each one representing the set of tweets related to a particular event. The rest of the paper is organized as follows. Section II presents a brief review of the twitter data analysis techniques. Sections III presents the proposed data mining technique for tweets classification and analysis. Section IV presents the experimental setup and results. Finally, section V concludes the paper with future directions of work. II. RELATED WORK Twitter has recently evolved as a popular micro-blogging website and consequently a number of methods are proposed by different researchers to analyze twitter data for varied purposes. Chung and Mustafara [1] examined the predictive power of social media (especially, the Twitter) using sentiment analysis methods and identified conflicting results in the domain of US political elections held in Cheong and Lee [2] studied the detection of interesting patterns using Self- Organizing Map (SOM) related to 2009 Iran election issue and the iphone OS 3.0 software launch. Akcora and Ferhatosmanoglu [3] identified the breakpoints in public opinion to extract major news about the events effectively and developed an application where users can view the important news stories and find the related articles on the Web. Bollen et al. [4]
2 proposed a method to make precise and useful predictions for stock market. In [5], Thelwall assessed whether popular events are typically associated with increase in sentiment strength. Tracking influence is another important task related to twitter data analysis. Influential users play an important role in the society, and influence tracking may be useful for a number of applications ranging from election to marketing. In [6], the authors analyzed the role of structural features like indegrees, retweets, and mentions for influence tracking and investigated the dynamic nature of user influence across topic and time. Sentiment analysis and opinion mining is another important field in twitter data mining. Pak and Paroubek [7] proposed a method based on the linguistic analysis of tweets for sentiment analysis on twitter data. Their system is able to determine the positive, negative, and neutral sentiments of a given tweet. In [8], the authors proposed an algorithm to classify tweets as positive or negative. They studied a number of classifiers based on n-gram and Parts-Of-Speech (POS) tag features and reported that multinominal naive Bayes unigram using mutual information outperforms the other approaches. Another area of research related to twitter data mining is event identification. In [9], the authors proposed a general framework for event identification in social media documents. They used similarity metric learning approaches to produce high quality clustering results. They reported that similarity metric learning techniques yield better performance than traditional approaches that considers text-based similarity. Sakaki et al. [10] proposed a method to identify real-time events. They proposed an algorithm to identify target events in real-time and considered tweets-related features like keywords, number of words and their context for detecting the target events. The main focus of their study is to identify earthquake event. In [11], the authors developed a system to classify tweets into real-world event tweets and non-event tweets. To this end, they have considered temporal, social, topical, and Twitter-centric features for classification of tweets into event and non-event class. In contrast to classifying tweets into only two classes, we consider tweets analysis as a clustering problem in which tweets are classified into various classes based on how many events are described by them. III. PROPOSED METHOD In this section, we present the functional details of our proposed tweets mining approach, which aims to classify tweets based their relatedness with various events. Figure 1 presents the work-flow of the proposed method and highlights the functioning details of the various working modules. Tweets crawling aims to retrieve tweets from the server and store them on local machine for analysis. Tweets pre-processing and tokenization process aims to extract tweets contents, filter out unwanted constituents like embedded emoticons and URLs, and tokenize them into 1-grams for further processing. Feature extraction and social network generation identifies significant key terms from the tweets using Latent Dirichlet Allocation (LDA) method and use them to model the tweets as a social network. Finally, Markov clustering is applied on the generated social network to crystallize it into various clusters, each one representing a particular event. Further details about these functions are presented in the following sub-sections. Clustered Tweets Graph Clustering Tweets Crawling Social Network Tweets Feature Extraction and Social Network Generation Tweets Pre-Processing and Tokenization Fig. 1: Work-flow of the proposed twitter data mining technique A. Tweets Crawling We have developed a crawler using Twitter Application Program Interface (API) to retrieve tweets from the server and store them on local machine for further processing. In addition to tweets, the crawler also retrieves various users and tweets related structural features and stores them in a structured format. The structural features can be clubbed with the contents to design a better tweets analysis system, which is one of our future directions of work. B. Tweets Pre-processing and Tokenization Tweets pre-processing aims to filter out unwanted constituents like special characters, emoticons, URLs etc. associated with each tweets. Thereafter, n-gram technique with the value of n as 1 is applied to tokenize tweets into bag-of-words. C. Feature Extraction and Social Network Generation In this phase, significant key terms are identified as tweets features to represent them into vector-space model, which is the first step for the generation of social graph. For significant key terms identification process, each token of a tweet is considered as a candidate term provided it is neither a stopword nor containing any special characters. Once the list of candidate terms for each tweet is identified, a term-tweet matrix A of order m n is generated, where m is the number of candidate terms and n is the number of tweets. The row of the matrix A represents a term vector and that a column represents a tweet vector. The (i, j) th element of matrix A, a ij, is determined as the weight of the term t i in j th tweet. The weight of a term t i in j th tweet, ω(t i,j ), is calculated using equations 1 and 2 where, tf(t i,j ) is the number of times t i occurs in j th tweet. D is the total number of tweets, and {d j : t i d j } is the number of tweets containing t i. The matrix A is normalized in such a way that the length of the tweet vectors becomes 1. ω(t i,j )=tf (t i,j ) idf (t i ) (1)
3 D idf (t i )=log {d j : t i d j } + 1 (2) Since the term-tweet matrix is generally sparse matrix and the dimension of the term as well as tweet vectors are large, Singular Value Decomposition (SVD) is applied to map the feature set into a low-dimensional space. This increases the efficiency of the proposed method both in terms of memory and computing time requirements. For a given m n matrix with m n, the SVD decomposes it into an m n orthogonal matrix U, ann n diagonal matrix S, and an n n orthogonal matrix V such that A = USV T. In this decomposition, U represents the term matrix and V represents the tweet matrix. Each row of matrix V represents a tweet vector whose dimension is reduced from m to n in the new feature space. To assign numeric scores to the candidate terms, we use Latent Dirichlet Allocation (LDA), which is a generative probabilistic model in which documents are represented as random mixtures over latent topics characterized by a distribution over words [12]. For LDA execution, we create a data file using the clusters of the tweets. In this file, the first line contains an integer value k representing the number of clusters (number of documents for LDA). Following this, there are k paragraphs, one for each cluster, containing the list of terms obtained from the tweets belonging to the corresponding cluster. We have used JGibbLDA 1 to execute LDA to generate Θ and Φ matrices. We have set the Dirichlet hyper parameters α and β as 0.1 and 0.5, respectively at the time of LDA execution. The Φ matrix contains the term-topic distributions, i.e., p(term t topic t ). Each row in this matrix is a topic and each column is a candidate term in the document. The Θ matrix contains the topic-cluster distributions, i.e., p(topic t cluster c ). Each row in this matrix is a cluster (document) and each column is a topic. We use Φ and Θ matrices to assign a numeric score to each term using equations 3 and 4 in which s[l] is the size (number of terms) of the l th cluster, n is the number of topics (we have taken n = 100), and k is the number of clusters that is treated as number of documents in this case. After calculating the score of each term, we arrange them in decreasing order of their scores and consider top-n terms as significant key terms. Further details related to the identification of key terms (aka key phrases) can be found in one of our previous works [13]. score(t i )=max n j =1 {Φ j,i ω j } (3) ω j = k Θ l,j s[l] (4) l=1 The top n key terms are used to represent each tweet as an n-dimensional feature vector. The i th element of the feature vector of a tweet is set to 1 if the i th key term is present in the tweet, otherwise it is set to 0. Thereafter, the social network is generated as a weighted graph in which tweets 1 represent nodes and similarity value between a pair of tweets is considered as a weighted edge between them, provided it is greater than 0. Cosine similarity is used to calculate similarity between a pair of tweets. Cosine similarity is one of the most popular similarity measures to calculate the similarity between two n dimensional vectors, measuring the cosine of the angle between them. The cosine similarity of two n-dimensional vectors a and b can be calculated using equation 5. Cosine(a, b) = n i=1 a i b i n i=1 (a i) 2 n i=1 (b i) 2 (5) D. Graph Clustering Once the social network is generated for the complete set of tweets, Markov CLustering (MCL) algorithm is applied to crytallize in into various clusters, each one representing a particular event. The MCL transforms the given graph into directed graph with several weakly connected components, each one resulting into a separate cluster. The MCL is an iterative method that interleaves matrix expansion and inflation steps [14]. Matrix expansion corresponds to taking successive powers of the transition matrix, while matrix inflation makes the higher probability transition and reduces the lower probability transition. It should be noted that MCL does not require the number of clusters k parameter for clustering, rather it requires an inflation parameter r > 1. A small value of r results in small number of clusters of larger size, whereas a high value of r generates large number of clusters of smaller size. IV. EXPERIMENTAL SETUP AND RESULTS In this section, we present our experimental setup and evaluation results. For experiment, we have crawled 3100 tweets related to three different events Israel-Gaza conflict, Delhi assembly election, and union budget 2015 using Twitter s API. Out of these 3100 tweets, 1500 are related to Israel- Gaza conflict, 900 are related to Delhi assembly election, and remaining 700 are related to union budget Further statistics about these data sets is given in Table I. We have applied the key term extraction process discussed in the previous section and top-100 key terms, as shown in Table II, are considered for vector-space modelling of the tweets. These key terms are used to generate the feature vectors of each tweet. The feature vector of a tweet is a 100-dimensional binary vector in which the values are either 1 or 0, depending on the presence or absence of the respective key term in the tweet. Finally, social network is clustered using Markov CLustering (MCL) algorithm, for which the value of inflation parameter r is determined empirically. MCL is applied on the social network with different values of r, ranging from 1.2 to 5.2, and finally 1.5 is considered as the optimal one, as shown in Figure 3, for further experimentation. The clustered tweets at r =1.5 is shown in Figure 2, in which each cluster corresponds to a particular event. It can be seen in this figure that besides three bigger clusters there are some isolated nodes
4 TABLE I: Statistics of the Twitter data sets Data set category Tweets-related statistics User-related statistics Number of Avg. no. of Avg. no. of Avg. no. of Avg. no. of Avg. no. of Avg. no. of tweets hashtags URLs mentions followers friends tweets Israel-Gaza Conflict Delhi Assembly Election Union Budget Total count TABLE II: Significant key terms and their LDA scores Key term Score Key term Score Key term Score Key term Score palestine reasons results budget gaza hammas corporate prayforgaza israel free responsibility attack delhi stop uphold injured aap people syria rights budget introspect chief narendramodi kejriwal conflict victory finance union civilians sarkar abppensioen israeli occupation pray divest bjp india jaitley dead hamas war soldiers banks bedi human terrorists financing palestinian don support industry gazaunderattack polls military meets kiran arvindkejriwal minister freedom unionbudget jews mahmoud innocent arvind peace media party killed freepalestine highlights protest loss kill killing latest dilli rockets abbas govt fatwa hospital voted secretary modi world elections death blame illegal aapsweep netanyahu children kiski tax superbudget election president arun rail that do not belong to any of them. On analysis we found that the tweets corresponding to these nodes are generally not directly related to the events under consideration and they can be considered as outliers. For evaluation of the proposed method, we have considered F α=0.5 (or F P at α =0.5) and F B cubed (or F B ), where F P is defined as the harmonic mean of purity and inverse purity, and F B is defined as the harmonic mean of B-cubed precision and B-cubed recall, which are defined in the following paragraphs. Purity and Inverse Purity: If C i is a cluster (containing the i th tweet) generated by the system, L j is the actual cluster (containing the j th tweet), and n is the total number of tweets, then purity and inverse purity are defined using equations 6 and 7, respectively. purity = i C i n maxp recision(c i,l j ) (6) inversep urity = L i i n maxp recision(l i,c j ) (7) B-Cubed Precision and B-Cubed Recall: If C i is a cluster (containing the i th tweet) generated by the system, L i is the actual cluster (containing the i th tweet) then precision and recall of the i th tweet are defined using equations 8 and 9, respectively. Thereafter, the mean of all individual precisions and recalls is taken as the values of B-cubed precision and B-cubed recall, respectively. precision(i) = C i L i C i recall(i) = C i L i (9) L i For evaluation purpose, we have labeled each Israel-Gaza conflict related tweet by G, each Delhi assembly election related tweet by DE, and each tweet related to union budget 2015 by UB. We have also developed a Java application to calculate the values of the above-mentioned evaluation metrics in an automatic manner. Evaluation results for different values of the inflation parameter, r, are shown in Table III. Figure 3 is a visual representation of the results shown in Table III. It can be observed from this figure that the values of both F P and F B parameters is highest for r =1.5. V. CONCLUSION AND FUTURE WORK In this paper, we have presented a Twitter data mining technique for events classification and analysis. LDA is used to identify significant key terms for tweets representation using (8)
5 TABLE III: Evaluation results of the proposed method for different inflation parameter (r) values r No. of connected No. of No. of Purity Inverse Purity F P Average B- Average B- F B com- ponents nodes isolated nodes cubed precision cubed recall crystallize the social network into various clusters, each one representing a particular event. Since Twitter API provides various structural features, development of a hybrid approach to analyze twitter data using structural and content-related features could be a promising area of future research. ACKNOWLEDGMENT This work was funded by the National Plan for Science, Technology and Innovation (MAARIFAH), King Abdulaziz City for Science and Technology, Kingdom of Saudi Arabia under the Award Number 12-INF Value of F P / F B Fig. 2: Clustered tweets using MCL for r = FP FB Inflation Parameter (r) Fig. 3: Visualization of F P and F B measures for different inflation parameter (r) values the vector-space model. We have also proposed a social network generation method, which models tweets as a weighted graph in which the weight of an edge represents the topical similarity of the tweets. Finally, Markov clustering is used to REFERENCES [1] J. Chung and E. Mustafaraj, Can collective sentiment expressed on twitter predict political elections? in Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011, pp [2] M. Cheong and V. Lee, A study on detecting patterns in twitter intratopic user and message clustering, in Proceedings of the th International Conference on Pattern Recognition, 2010, pp [3] C. G. Akcora, M. A. Bayir, M. Demirbas, and H. Ferhatosmanoglu, Identifying breakpoints in public opinion, in Proceedings of the First Workshop on Social Media Analytics, 2010, pp [4] J. Bollen, H. Mao, and X.-J. Zeng, Twitter mood predicts the stock market, Journal of Computational Science, vol. 2, no. 1, pp. 1 8, [5] M. Thelwall, K. Buckley, and G. Paltoglou, Sentiment in twitter events, Journal of the American Society for Information Science and Technology, vol. 62, no. 2, pp , [6] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, Measuring user influence in twitter: The million follower fallacy, in Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010, pp [7] A. Pak and P. Paroubek, Twitter as a corpus for sentiment analysis and opinion mining, in Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), 2010, pp [8] A. Go, L. Huang, and R. Bhayani, Twitter sentiment analysis, Stanford University, Stanford, California, USA, CS224N - Final Project Report, [9] H. Becker, M. Naaman, and L. Gravano, Learning similarity metrics for event identification in social media, in Proceedings of the third ACM international conference on Web search and data mining, 2010, pp [10] T. Sakaki, M. Okazaki, and Y. Matsuo, Earthquake shakes twitter users: Real-time event detection by social sensors, in Proceedings of the 19th international conference on World wide web, 2010, pp [11] H. Becker, M. Naaman, and L. Gravano, Beyond trending topics: Real-world event identification on twitter, in Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 2011, pp [12] D. M. Blei, A. Y. Ng, and M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, vol. 3, no. 4-5, pp , [13] M. Abulaish, Jahiruddin, and L. Dey, Deep text mining for automatic keyphrase extraction from text documents, Journal of Intelligent Systems, vol. 20, no. 4, pp , [14] S. van Dongen, Graph clustering by flow simulation, Ph.D. Thesis, University of Utrecht, Utrecht, Netherlands, 2000.
Tweets Miner for Stock Market Analysis
Tweets Miner for Stock Market Analysis Bohdan Pavlyshenko Electronics department, Ivan Franko Lviv National University,Ukraine, Drahomanov Str. 50, Lviv, 79005, Ukraine, e-mail: [email protected]
Sentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
Using Twitter as a source of information for stock market prediction
Using Twitter as a source of information for stock market prediction Ramon Xuriguera ([email protected]) Joint work with Marta Arias and Argimiro Arratia ERCIM 2011, 17-19 Dec. 2011, University of
Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques
CSE 598 Project Report: Comparison of Sentiment Aggregation Techniques Chris MacLellan [email protected] May 3, 2012 Abstract Different methods for aggregating twitter sentiment data are proposed and three
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
Semantic Sentiment Analysis of Twitter
Semantic Sentiment Analysis of Twitter Hassan Saif, Yulan He & Harith Alani Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom The 11 th International Semantic Web Conference
Twitter Analytics: Architecture, Tools and Analysis
Twitter Analytics: Architecture, Tools and Analysis Rohan D.W Perera CERDEC Ft Monmouth, NJ 07703-5113 S. Anand, K. P. Subbalakshmi and R. Chandramouli Department of ECE, Stevens Institute Of Technology
Analysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet
CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Sentiment Analysis Tool using Machine Learning Algorithms
Sentiment Analysis Tool using Machine Learning Algorithms I.Hemalatha 1, Dr. G. P Saradhi Varma 2, Dr. A.Govardhan 3 1 Research Scholar JNT University Kakinada, Kakinada, A.P., INDIA 2 Professor & Head,
Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians
Multilanguage sentiment-analysis of Twitter data on the example of Swiss politicians Lucas Brönnimann University of Applied Science Northwestern Switzerland, CH-5210 Windisch, Switzerland Email: [email protected]
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Exploring Big Data in Social Networks
Exploring Big Data in Social Networks [email protected] ([email protected]) INWEB National Science and Technology Institute for Web Federal University of Minas Gerais - UFMG May 2013 Some thoughts about
1 o Semestre 2007/2008
Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
Monitoring and Analyzing Customer Feedback Through Social Media Platforms for Identifying and Remedying Customer Problems
Monitoring and Analyzing Customer Feedback Through Social Media Platforms for Identifying and Remedying Customer Problems Sumit Bhatia, Jingxuan Li, Wei Peng, and Tong Sun Xerox Research Centre, Webster,
IT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
Spatio-Temporal Patterns of Passengers Interests at London Tube Stations
Spatio-Temporal Patterns of Passengers Interests at London Tube Stations Juntao Lai *1, Tao Cheng 1, Guy Lansley 2 1 SpaceTimeLab for Big Data Analytics, Department of Civil, Environmental &Geomatic Engineering,
Microblog Sentiment Analysis with Emoticon Space Model
Microblog Sentiment Analysis with Emoticon Space Model Fei Jiang, Yiqun Liu, Huanbo Luan, Min Zhang, and Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory
Topic models for Sentiment analysis: A Literature Survey
Topic models for Sentiment analysis: A Literature Survey Nikhilkumar Jadhav 123050033 June 26, 2014 In this report, we present the work done so far in the field of sentiment analysis using topic models.
Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques.
Twitter Sentiment Analysis of Movie Reviews using Machine Learning Techniques. Akshay Amolik, Niketan Jivane, Mahavir Bhandari, Dr.M.Venkatesan School of Computer Science and Engineering, VIT University,
Twitter Stock Bot. John Matthew Fong The University of Texas at Austin [email protected]
Twitter Stock Bot John Matthew Fong The University of Texas at Austin [email protected] Hassaan Markhiani The University of Texas at Austin [email protected] Abstract The stock market is influenced
Latent Semantic Indexing with Selective Query Expansion Abstract Introduction
Latent Semantic Indexing with Selective Query Expansion Andy Garron April Kontostathis Department of Mathematics and Computer Science Ursinus College Collegeville PA 19426 Abstract This article describes
Sentiment Analysis and Time Series with Twitter Introduction
Sentiment Analysis and Time Series with Twitter Mike Thelwall, School of Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1LY, UK. E-mail: [email protected]. Tel: +44 1902
Twitter sentiment vs. Stock price!
Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Term extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Sentiment analysis: towards a tool for analysing real-time students feedback
Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: [email protected] Mihaela Cocea Email: [email protected] Sanaz Fallahkhair Email:
Data Mining in Web Search Engine Optimization and User Assisted Rank Results
Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks
A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks Text Analytics World, Boston, 2013 Lars Hard, CTO Agenda Difficult text analytics tasks Feature extraction Bio-inspired
Sentiment Analysis on Big Data
SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social
Text Opinion Mining to Analyze News for Stock Market Prediction
Int. J. Advance. Soft Comput. Appl., Vol. 6, No. 1, March 2014 ISSN 2074-8523; Copyright SCRG Publication, 2014 Text Opinion Mining to Analyze News for Stock Market Prediction Yoosin Kim 1, Seung Ryul
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Sentiment Analysis and Topic Classification: Case study over Spanish tweets
Sentiment Analysis and Topic Classification: Case study over Spanish tweets Fernando Batista, Ricardo Ribeiro Laboratório de Sistemas de Língua Falada, INESC- ID Lisboa R. Alves Redol, 9, 1000-029 Lisboa,
A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS
A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS Stacey Franklin Jones, D.Sc. ProTech Global Solutions Annapolis, MD Abstract The use of Social Media as a resource to characterize
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Initial Report. Predicting association football match outcomes using social media and existing knowledge.
Initial Report Predicting association football match outcomes using social media and existing knowledge. Student Number: C1148334 Author: Kiran Smith Supervisor: Dr. Steven Schockaert Module Title: One
Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams
2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. XX (2012) (2012) IACSIT Press, Singapore Using Text and Data Mining Techniques to extract Stock Market Sentiment
End-to-End Sentiment Analysis of Twitter Data
End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India [email protected],
Keywords social media, internet, data, sentiment analysis, opinion mining, business
Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real time Extraction
A Comparative Study on Sentiment Classification and Ranking on Product Reviews
A Comparative Study on Sentiment Classification and Ranking on Product Reviews C.EMELDA Research Scholar, PG and Research Department of Computer Science, Nehru Memorial College, Putthanampatti, Bharathidasan
Statistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Forecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia [email protected] Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
Emoticon Smoothed Language Models for Twitter Sentiment Analysis
Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Emoticon Smoothed Language Models for Twitter Sentiment Analysis Kun-Lin Liu, Wu-Jun Li, Minyi Guo Shanghai Key Laboratory of
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour
Using Social Media for Continuous Monitoring and Mining of Consumer Behaviour Michail Salampasis 1, Giorgos Paltoglou 2, Anastasia Giahanou 1 1 Department of Informatics, Alexander Technological Educational
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Semi-Supervised Learning for Blog Classification
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,
Mammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
Can Twitter provide enough information for predicting the stock market?
Can Twitter provide enough information for predicting the stock market? Maria Dolores Priego Porcuna Introduction Nowadays a huge percentage of financial companies are investing a lot of money on Social
Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *
Send Orders for Reprints to [email protected] 766 The Open Electrical & Electronic Engineering Journal, 2014, 8, 766-771 Open Access Research on Application of Neural Network in Computer Network
Combining Social Data and Semantic Content Analysis for L Aquila Social Urban Network
I-CiTies 2015 2015 CINI Annual Workshop on ICT for Smart Cities and Communities Palermo (Italy) - October 29-30, 2015 Combining Social Data and Semantic Content Analysis for L Aquila Social Urban Network
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Research Article 2015. International Journal of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-4, Issue-4) Abstract-
International Journal of Emerging Research in Management &Technology Research Article April 2015 Enterprising Social Network Using Google Analytics- A Review Nethravathi B S, H Venugopal, M Siddappa Dept.
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis
A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Using social media for continuous monitoring and mining of consumer behaviour
Int. J. Electronic Business, Vol. 11, No. 1, 2014 85 Using social media for continuous monitoring and mining of consumer behaviour Michail Salampasis* Department of Informatics, Alexander Technological
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
A Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
Role of Social Networking in Marketing using Data Mining
Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:
Folksonomies versus Automatic Keyword Extraction: An Empirical Study
Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk
Text Analytics. A business guide
Text Analytics A business guide February 2014 Contents 3 The Business Value of Text Analytics 4 What is Text Analytics? 6 Text Analytics Methods 8 Unstructured Meets Structured Data 9 Business Application
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs
Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1
A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1 Yannis Stavrakas Vassilis Plachouras IMIS / RC ATHENA Athens, Greece {yannis, vplachouras}@imis.athena-innovation.gr Abstract.
Effective Mentor Suggestion System for Collaborative Learning
Effective Mentor Suggestion System for Collaborative Learning Advait Raut 1 U pasana G 2 Ramakrishna Bairi 3 Ganesh Ramakrishnan 2 (1) IBM, Bangalore, India, 560045 (2) IITB, Mumbai, India, 400076 (3)
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
Robust Sentiment Detection on Twitter from Biased and Noisy Data
Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research [email protected] Junlan Feng AT&T Labs - Research [email protected] Abstract In this
Recommending News Articles using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1
Paper 1886-2014 Recommending News s using Cosine Similarity Function Rajendra LVN 1, Qing Wang 2 and John Dilip Raj 1 1 GE Capital Retail Finance, 2 Warwick Business School ABSTRACT Predicting news articles
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
