Data Mining Project Report Xiao Liu, Wenxiang Zheng October 2, 2014 1 Abstract This paper reports the stage of our team s term project through out the first five weeks of the semester. As literature review, this paper first introduces our motivations on the research topic about data mining in movie reviews. In this project, we will particularly focus on the reserach question: How to predict ratings from movie reviews by sentiment analysis. Review mining is a subtopic of Sentiment Analysis, which refers to the use of natural language processing (NLP). In the introduction section, we will also present the related work and methods that have been studied in the past, and we will also discuss the potential challenges to this research topic. This paper also provides some basic definitions and concepts that are related to such research topic. In addition, the paper also provides a description of the data set that will be used in our research project. In the related work section, the paper discusses two different major methods in sentiment analysis research: sentiment classifcations and aspectbased sentiment analysis. Finally, this paper concludes with a discussion about some of the problems that we need to consider when we take further actions in this research project. 2 Introduction The online marketplace has been booming in recent years as technologies and information systems become matured and accessible. The evolution of the online shopping technologies and systems have already impacted the whole business market and have shifted people s shopping habit, where all kind of products are available on the internet marketplace, such as Amazon, ebay, and BestBuy. More convincingly, those traditional retailer giants, such as Walmart and Target, have also developed their online shopping websites in order to compete for a share in this fastly changing market. In addition, as the derived products from E-commerce, the online reviwing systems, such as Rotten Tomatoes, IMDB, and Yelp, have also become important factors that affect people s preferences when doing online shopping. For example, people who want to buy a BBQ grill on Amazon will make comparisons on reviews of similar products before making a decision. However, with the ascending numbers of reviews available online for products, it is very difficult for people to find useful and meaningful reviews that can help making better decisions, since there are also many fake reviews exposed on review systems. In this case, it is necessary to provide a well-developed solution to make the review systems smarter so that the systems 1
can detect and filter the unuseful reviews. The review system is more like a shopping assistant to help customers make shopping decisions. In this paper, we are particularly focusing on data mining in movie reviews. Improving such a movie review system will benifit customers, movie producers, and movie sellers. For example, producers of the movie Hunter Game can improve the movie story for the next series based on the online reviews, which will help them analyze what people want to watch the most for the next movie. In addition, they can also analyze the data on the review system to come up with other ideas about what other movies they can produce. On the other side, however, the online movie sellers can effectively predict what kind of movies that they can stock up more to satisfy the market demand, based on the past movie reviews and the reviews for new movie trailers. Review mining is a subtopic of Sentiment Analysis, which refers to the use of natural language processing. By analyzing the movie reviews, especially the polarity of customers attitudes, we may predict the average rating or even the performance of a certain coming movie. Meanwhile, a summary of a certain movie is provided as a reference for other customers. Review mining can be applied to many datasets, in this paper we are interested in mining in the Amazon movie reviews. Potential challenge in this problem is that we need to find out the right research method and algorithm to categorize different factors involved in the data set. We also need to design the appropriate sentiment analysis in the research, since there will be a lot of text involved in the review. In addition, finding out the right measurements and standards to define the possitive and negative reviews, or useful and unuseful reviews, is another big challenge in this problem as we need to make predictions and recommendations based on movie reviews. 3 Definitions 3.1 Basic Definitions Sentiment Analysis: are computational studies of opinions, sentiments, subjectivity, evaluations, attitudes, appraisal, affects, views, and emotions. Sentiment Classification: are deployed both on the document level and the sentence level. As indicated by the term classification, it is basically a text classification problem that classifies a whole opinion document/sentence based on the overall sentiment of the opinion holder. Aspect-based Sentiment Analysis: are refered to determining the opinions and sentiments expressed on different features or aspects of entities. 3.2 Data Set Description We grab the data set from Stanford SNAP group and it is a data set of Amazon movie reviews crawled from the web. For each instance, it contains the information as shown in Table 1. The data span a period of more than 10 years, including all 8 million reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews 2
from all other Amazon categories. There are in total 7911684 instances in this data set. The reviews are collected from 253059 users for 253059 movies and among these reviewers, 16341 of them have reviews more than 50 pieces. The reviews quality are pretty high as the median number of words per review is 101 [ML13]. Table 1: Dataset Name Example product/productid B00006HAXW review/userid A1RSDE90N6RSZF review/profilename Joseph M. Kotow review/helpfulness 9/9 review/score 5.0 review/time 1042502400 review/summary Pittsburgh - Home of the OLDIES review/text I have all of the doo... 4 Research Questions After observing the dataset together with some basic literature review, in this project, we are more interested in making predictions and analysing the common features based on this dataset. We have the following well-defined candidate question: How to predict ratings from reviews by sentimental analysis? In order to make the question more specific and the research method more feasible, we still need further discussions as well as stepping stone tests. 5 Related Works Sentiment Analysis and Opinion Mining are computational studies of opinions, sentiments, subjectivity, evaluations, attitudes, appraisal, affects, views, emotions, etc., expressed in text [PL08]. Since the sentiment analysis has been a hot topic over years, there are a number of publicatuions in this research area. Meanwhile, a number of opinion mining applications in the market, such as the opinion observe [LHC05] that conducts analysis on the cellphone reviews; Aspect-based opinion summary for both the Bing search engine and Google product search [BG+08]; tools like OpinionEQ 1 that integrates a few sentiment analysis functions; and live track of movies that predicts user ratings from the Twitter posts [TNK10]. In this report of literature review, we report the related works in two perspectives: (1) sentiment classification, and (2) aspect-based sentiment analysis. Before stepping into the first branch, we formalize the definitions of some related terms. Opinions are those words expressed one s feeling to an object. In one piece of opinion, there are opinion targets, features, sentimental positive 1 http://www.opinioneq.com. 3
or negative, opinion holder as well as the time [DMS00]. For example, in the sentence Alex bought a Cannon camera two weeks ago, and he loves it because the pictures are beautiful and high quality. (Target: Cannon camera; Features: picture quality; Sentimental pos or nag: positive love ; Opinion holder: Alex; Time: two weeks ago) Usually, we use the quinttuples to describe an opinion to make the unstructured data into structured data [Liu07]. 5.1 Sentiment Classification Sentiment classifications are deployed both on the document level and the sentence level. As indicated by the term classification, it is basically a text classification problem that classifies a whole opinion document/sentence, based on the overall sentiment of the opinion holder [Tur02; PLV02]. Obviously, for a classification problem, both the unsupervised and supervised learning are adopted. Unsupervised: Unsupervised methods derive a sentiment metric for text without training corpus, and it has been widely used, since the early time when this topic was first introduced. It is a fanscinating problem for researchers to study; however, the sentiment classification is hard to deploy in the real research and experiment as there are many potential challenges in this method. Turney [Tur02] predicates the sentiment orientation of a review by the average semantic orientation of the phrases in the review that contain adjectives or adverbs, which is denoted as the semantic oriented method. They use three steps in this unsuervised classification: POS tags, Sentiment orientation(so) estimation of the extracted phrases, and Average SO computing. Kim and Hovy [KH04] build three models to assign a sentiment category to a given sentence by combining the individual sentiments of sentiment- bearing words. Hiroshi [HTH04] use the technique of deep language analysis for machine translation to extract sentiment units in text documents. Devitt and Ahmad [DA07] explore a computable metric of positive or negative polarity in financial news text. Supervised: Supervised methods consider the sentiment analysis task as a classification task and use labeled corpus to train the classifier. In majority, three classification techniques are tried: Naive Bayes, Maximum entropy, and Support vector machine. A few features cater to the researchers are term frequency, POS tag, opinion words and phrases, negations, syntatic dependency, etc. Since the work of Pang et al [PLV02], various classification models and linguistic features have been proposed to improve the classification performance. Mullen and Collier [MC04]; Wilson et al. [WWH05]. Most recently, McDonald et al. [TM08] investigate a structured model for jointly classifying the sentiment of text at varying levels of granularity. Blitzer et al. [BDP07] investigate domain adaptation for sentiment classifiers, focusing on online reviews for different types of products. Andreevskaia and Bergler [AB07] present a new system consisting of the ensemble of a corpusbased classifier and a lexicon-based classifier with precision-based vote weighting. 4
This Sentiment Classification method has been well-studied and implemented in many research projects; however, the potential challenges to deploy such method are also obvious. After reviewing related literatures in this topic, the Sentiment Classification method has the following limitations: The Sentiment Classification work for only one object in the document or sentence This method cannot extract different opinions This method cannot correctly extract indirect/unobvious opinions this method does not work for comparison reviews For example, in the sentence We bought the car last month and the windshield wiper has fallen off. There are two targets mentioned in this sentence, car and windshield wiper, and the opinion identification is unobvious. The Sentiment Classification method cannot detect whehter an opinion towards the car or the windshield wiper in this sentence. 5.2 Aspect-based Sentiment Analysis Sentiment classification method at both the document and sentence levels is quite useful; however, it does not find out what people like or dislike. In this case, another branch of Sentiment Analysis called Aspect-based Sentiment Analysis emerges. This method extracts entities and aspects (target, feature, opinion, and time) from documents. To extract the entities, some methods are considered by researchers: Distributional similarity [JO11](which compares the surrounding text of candidates using cosine or PMI), PU learning [Liu+02](which learns from positive and unlabeled examples), and bayesian sets [HG05]. To extract the aspects, Liu first introduces a frequency-based method in 2004 [HL04] because he considers the reviews from different people are irrelevant. When aspects/features are discussed, the words used converge. Later on, various improved methods are applied based on the first one. Zhuang et al [ZJZ06] improve the recall due to loss of infrequent aspects by using opinion words to extract the aspects; Popescu and Etzioni [PNE05] improve the precision by removing the frequent noun phrases that may not be aspects using part-of its relationship; Qiu [Qiu+11] applies the double propagation (DP) approach, which uses dependency of opinions and aspects to extract both aspects and opinion words. 6 Discussion According to the data set that we obtain, the meta-data of the data set is wellstructured and intuitive so that we know the name of review commenter, movie name, time of the review, and the review itsefl from the database. After considering the advantages and disadvantages of the two sentiment analysis methods, we believe that the Aspect-based Sentiment Analysis method is the most suitable method is deploy in our research topic. This method can help us extracting different opinions in the review as reviewers usually have different opinions for 5
different parts of a movie. In addition, this method can help us identifying comparison opinions and unobvious/indirect opinions in the review. Although the Aspect-based Sentiment Analysis method seems to match the database that we use in the following research, there are several potential problems that we need to consider: Identify different comparative and implicit opinions Identify reviewer s emotions Measurement of the level of opinions that matches related ratings 7 Contribution Xiao contributed on research topic and data set selections, and she contributed to most of the related work part of this report. Wenxiang contributed to most of the writeup of this paper, and he also contributed on profreading and editing the paper. Citations [DMS00] [Liu+02] [PLV02] [Tur02] [HTH04] [HL04] [KH04] Robert Dale, Hermann Moisl, and Harold Somers. Handbook of natural language processing. CRC Press, 2000. Bing Liu et al. Partially supervised classification of text documents. In: ICML. Vol. 2. Citeseer. 2002, pp. 387 394. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing-volume 10. Association for Computational Linguistics. 2002, pp. 79 86. Peter D Turney. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics. 2002, pp. 417 424. Kanayama Hiroshi, Nasukawa Tetsuya, and Watanabe Hideo. Deeper sentiment analysis using machine translation technology. In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics. 2004, p. 494. Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2004, pp. 168 177. Soo-Min Kim and Eduard Hovy. Determining the sentiment of opinions. In: Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics. 2004, p. 1367. 6
[MC04] [HG05] [LHC05] [PNE05] [WWH05] [ZJZ06] [AB07] [BDP07] [DA07] Tony Mullen and Nigel Collier. Sentiment Analysis using Support Vector Machines with Diverse Information Sources. In: EMNLP. Vol. 4. 2004, pp. 412 418. Katherine A Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In: Proceedings of the 22nd international conference on Machine learning. ACM. 2005, pp. 297 304. Bing Liu, Minqing Hu, and Junsheng Cheng. Opinion observer: analyzing and comparing opinions on the web. In: Proceedings of the 14th international conference on World Wide Web. ACM. 2005, pp. 342 351. Ana-Maria Popescu, Bao Nguyen, and Oren Etzioni. OPINE: Extracting product features and opinions from reviews. In: Proceedings of HLT/EMNLP on interactive demonstrations. Association for Computational Linguistics. 2005, pp. 32 33. Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics. 2005, pp. 347 354. Li Zhuang, Feng Jing, and Xiao-Yan Zhu. Movie review mining and summarization. In: Proceedings of the 15th ACM international conference on Information and knowledge management. ACM. 2006, pp. 43 50. Alina Andreevskaia and Sabine Bergler. CLaC and CLaC-NB: Knowledge-based and corpus-based approaches to sentiment tagging. In: Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics. 2007, pp. 117 120. John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: ACL. Vol. 7. Citeseer. 2007, pp. 440 447. Ann Devitt and Khurshid Ahmad. Sentiment polarity identification in financial news: A cohesion-based approach. In: ACL. Citeseer. 2007. [Liu07] Bing Liu. Web data mining. Springer, 2007. [BG+08] [PL08] [TM08] Sasha Blair-Goldensohn et al. Building a sentiment summarizer for local service reviews. In: WWW Workshop on NLP in the Information Explosion Era. 2008, p. 14. Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. In: Foundations and trends in information retrieval 2.1-2 (2008), pp. 1 135. Ivan Titov and Ryan T McDonald. A Joint Model of Text and Aspect Ratings for Sentiment Summarization. In: ACL. Vol. 8. Citeseer. 2008, pp. 308 316. 7
[TNK10] [JO11] [Qiu+11] [ML13] Tun Thura Thet, Jin-Cheon Na, and Christopher SG Khoo. Aspectbased sentiment analysis of movie reviews on discussion boards. In: Journal of Information Science (2010), p. 0165551510388123. Yohan Jo and Alice H Oh. Aspect and sentiment unification model for online review analysis. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM. 2011, pp. 815 824. Guang Qiu et al. Opinion word expansion and target extraction through double propagation. In: Computational linguistics 37.1 (2011), pp. 9 27. Julian John McAuley and Jure Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee. 2013, pp. 897 908. Articles only [PL08] [TNK10] [Qiu+11] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. In: Foundations and trends in information retrieval 2.1-2 (2008), pp. 1 135. Tun Thura Thet, Jin-Cheon Na, and Christopher SG Khoo. Aspectbased sentiment analysis of movie reviews on discussion boards. In: Journal of Information Science (2010), p. 0165551510388123. Guang Qiu et al. Opinion word expansion and target extraction through double propagation. In: Computational linguistics 37.1 (2011), pp. 9 27. Books only [DMS00] Robert Dale, Hermann Moisl, and Harold Somers. Handbook of natural language processing. CRC Press, 2000. [Liu07] Bing Liu. Web data mining. Springer, 2007. 8