A Hybrid Text Regression Model for Predicting Online Review Helpfulness

Size: px
Start display at page:

Download "A Hybrid Text Regression Model for Predicting Online Review Helpfulness"

Transcription

1 Abstract A Hybrid Text Regression Model for Predicting Online Review Helpfulness Thomas L. Ngo-Ye School of Business Dalton State College tngoye@daltonstate.edu Research-in-Progress Atish P. Sinha Lubar School of Business University of Wisconsin-Milwaukee sinha@uwm.edu Business intelligence and analytics are playing an increasingly prominent role in many organizations. User-generated content and online social media open up new opportunities for businesses that can exploit and innovate with this new source of Web 2.0 data. In this paper, we concentrate on one important application predicting the helpfulness of online customer reviews. We frame it as a regression problem and apply text mining techniques. We propose a hybrid feature selection approach, which combines a filter with a wrapper, for a BI text regression problem. Based on online review data collected from Amazon.com, we demonstrate empirically that the hybrid approach produces the best prediction results among all the models examined. This study is the first to develop and validate a hybrid feature selection methodology for text regression in a BI context. Keywords: online reviews, text regression, feature selection, filter, wrapper Introduction Organizations are getting increasingly aware of the need to leverage enterprise data through business intelligence (BI) and analytics. Traditional BI applications process structured data consolidated in data warehouses to generate reports, perform OLAP, and conduct predictive analytics (Watson and Wixom 2007). In the age of Web 2.0, online social media presents new challenges and opportunities for organizations to make use of this new type of voluminous and high velocity data. Knowledge derived from BI 2.0 applications can drive business innovation and benefit businesses with better decision-making, enhanced performance, improved company strategy, and sustained competitive advantage (Turban et al. 2010). In this research, we focus on one important BI 2.0 application investigating the helpfulness of online customer reviews. To predict online review helpfulness, we propose and test an innovative and effective framework a hybrid feature selection approach for text regression. Online Customer Reviews and Review Helpfulness In recent years, online social media have become an integral part of contemporary life. Online customer reviews, blogs, Facebook updates and comments, and tweets dominate user-generated content on the Web. Such user-generated content is a gold mine for product manufacturers, service providers, platform integrators, and third parties. Businesses are interested in harvesting market intelligence to better understand their customers perspectives, so that they can better position their products, improve service offerings, and enhance brand image and customer loyalty. Among all the different types of user-generated content, online customer reviews have the most straightforward and focused impact on businesses. Potential consumers may make their purchase decisions partially based on the informal online reviews posted by their peers. Online customer reviews serve an important function by mitigating consumers information seeking needs. Both consumers and businesses are keen to learn useful information from this new resource. For popular products on major review websites, there are often hundreds or even thousands of customer reviews. The sheer number of reviews poses a serious information overload problem. If unmanaged, users BI Congress 3: Driving Innovation through Big Data Analytics, Orlando, FL, December 2012

2 Research Track: DW/BI/Analytics from an application perspective will either be overwhelmed or simply explore only the first few available ones. To facilitate users information seeking needs, one commonly used criterion to present online reviews is to list them based on their posted date with the most recent one being the first. However, the most recent reviews do not necessarily happen to be useful ones. Another frequently used criterion to present reviews is based on the number of readers votes to the question Was this review helpful? Yes or No. Based on the number of readers votes, the customer reviews are ranked on the helpfulness dimension. However, many reviews do not attract much attention because they are relatively new and do not have enough time to accumulate readers votes (Kim et al. 2006). Therefore, it is desirable to develop a mechanism to automatically predict the helpfulness of relatively new reviews. Review helpfulness offers crucial information of readers aggregated judgment of a review s overall value. Businesses should be sensitive to the review helpfulness. Ignoring helpful reviews and failing to identify them may put businesses at a disadvantage by missing a valuable learning opportunity. Previous studies have acknowledged the important practical business implications for studying review helpfulness (Liu 2010; Pang and Lee 2008). In recent years, online customer review helpfulness has attracted some attention from researchers in different disciplines (Duan et al. 2010; Kim et al. 2006; Mudambi and Schuff 2010). The construct review helpfulness is also sometimes framed as review usefulness (Ghose and Ipeirotis 2007; Zhang 2008), review quality (Liu et al. 2007; Pang and Lee 2008), or review utility (Liu 2010; Zhang and Varadarajan 2006). Online customer reviews are not equally helpful or useful. Unhelpful reviews can be screened out and excluded from review summaries (Liu et al. 2007). Another potential usage of review helpfulness is to treat it as a weight in calculating the overall valuation of a product, which is the weighted average of the polarity of each individual review (Zhang 2008; Zhang and Varadarajan 2006). Consistent with the literature, we use the number of helpful votes from readers as the target variable to build a learning model of review helpfulness. The text mining models we construct in this study are text regression models. While review helpfulness has been framed as a text regression problem, to the best of our knowledge, our study is the first one to apply a hybrid feature selection approach with both filter and wrapper to generate an optimal subset of features. We also show that our proposed hybrid model outperforms the filter only model reported in the literature. Text Regression and Feature Selection The Bag-of-Words (BOW) model has been widely used in the text mining field (Sebastiani 2002). In the BOW representation, a text is reduced to a set of unique words and their corresponding counts or weights in the document. A word s position in the text, the part-of-speech, and other high-level grammar information are not included in the model. Although BOW representation seems rudimentary, it is surprisingly resilient and useful in real-world applications. Moreover, the BOW model can be enhanced in various ways, such as removing stop-words and stemming. The BOW model has its theoretical roots in the information retrieval literature and it can capture the basic content of a text to a certain extent (Turney and Pantel 2010). Comparing to other more sophisticated representations, the BOW model still remains a viable choice for text mining. For the online review helpfulness regression problem, the features represent the weights of the words appearing in the reviews. The BOW model can be used to characterize the main topic, concept, and sentiment of review content. It has been demonstrated as a useful model for estimating the helpfulness of online reviews (Kim et al. 2006; Ngo-Ye and Sinha 2012). A common challenge in text mining is high dimensionality, because a document collection tends to have thousands of unique words. Such high dimensionality leads to not only high computation costs but also to overfitting, which is why dimension reduction can be very useful (Sebastiani 2002). Some important ideas on dimension reduction, such as ranking and projection-based methods, have been summarized in (Abbasi and Chen 2008). Filter vs. Wrapper Approach One useful way to classify feature selection techniques is based on whether they employ the filter or the wrapper approach (Hall and Holmes 2003). For filter-based feature selection methods, some type of

3 Hybrid Text Regression Model for Predicting Review Helpfulness relevance measure, such as information gain, is applied to evaluate the importance of features. The relevance measures are independent of the learning algorithm. However, for wrapper-based methods, the very learning algorithm is used for evaluating the importance of the features. The rationale of the wrapper approach is that considering how the algorithm and the training set interact will help achieve the best possible performance on a particular training set with a particular algorithm (Kohavi and John 1997). While filter-based methods have low computational complexity, wrapper-based methods have very high computational complexity. For large datasets with a high dimensionality, wrapper approach is too computationally expensive to be feasible (Hall and Holmes 2003). On the other hand, wrapper-based methods tend to have better performance due to the use of same learning algorithm for feature evaluation. Moreover, wrapper-based methods can automatically determine the best subset of features, while heuristics are needed for filter-based methods to determine how many features to select (Chou et al. 2010). Regressional ReliefF (RReliefF) Ngo-Ye and Sinha (2012) have elaborated on why many traditional feature selection techniques are not applicable for text regression, including those investigated in opinion classification (Abbasi et al. 2008). Among the suitable choices of feature selection techniques for text regression, RReliefF has recently been shown to be a very competitive method for predicting online review helpfulness (Ngo-Ye and Sinha 2012). RReliefF is a relatively new extension of the original Relief, a feature ranking method. The principle behind the Relief family of algorithms is that the ideal feature is the one that has a different realized value for instances belonging to different classes and the same value for instances belonging to the same class (Robnik-Sikonja and Kononenko 2003). As an instance-based method, Relief has strong contextual nature. Therefore, it can handle conditional dependencies between attributes. As a non-myopic attribute quality estimator, it can exploit local information to obtain global view. Thus Relief performs robustly in domains with strong conditional dependencies between attributes. In the specific domain of online reviews, the features or review words may have strong interactions and dependencies. Proposed Hybrid Approach Both the wrapper and filter approaches are well known in the feature selection community. However, the hybrid approach of combining filter and wrapper for classification problem (Chou et al. 2010) is relatively new. In the hybrid approach, the full feature set is first passed through a filter to get ranked. Then a proper subset of top ranked features is selected and passed to the wrapper. The wrapper selects the best subset of features to build the final model. The hybrid methodology possesses the advantages of both the filter and wrapper approaches. The computation cost is moderate and the performance is better than the filter approach. Moreover, the optimal subset of features is automatically determined (Chou et al. 2010). Although the hybrid approach has been applied to text classification, it has not been reported for text regression in the literature. In this study, we explore the efficacy of the hybrid approach when applied to text regression for predicting online review helpfulness. Our goal is to examine if the hybrid model improves the prediction accuracy of review helpfulness over the filter only model. We expect the hybrid model performs better than the filter only model, because the learning algorithm is wrapped into the feature evaluation and selection process. Therefore, the features chosen are more optimized for the specific algorithm to be used in the final learning stage. We next present the proposed conceptual framework in Figure 1. In Figure 1, the models are introduced in sequence. First, we have the baseline ZeroR model. It always uses the average value of the target variable in the training dataset as the predicted value for the target variable in the test dataset (Witten et al. 2011). The ZeroR model captures prior knowledge of the target variable, represented by its average value in the training dataset. To expand the ZeroR model, we next consider BOW-based models. Compared to the rudimentary ZeroR model, elements of the review content are now captured and modeled. In the first conceptual type of BOW-based models, we retain all the review words and create the BOW Full Model. We next apply the RReliefF dimension reduction method to produce the filter only model. We keep the top ranked 300 review words as features, consistent with the study by Ngo-Ye and Sinha (2012).

4 Research Track: DW/BI/Analytics from an application perspective Finally, we apply the wrapper method with the text regression algorithm support vector regression (SVR) to the filter only model, which contains the top ranked 300 review words. The subset of features selected by the wrapper enters the final hybrid model. Baseline ZeroR Model BOW Full Model Use All Review Words as Features for Text Regression Select Features with Filter Regressional ReliefF Filter Only Model Select Features with Wrapper Support Vector Regression Hybrid (Filter + Wrapper) Model Figure 1. Hybrid Feature Selection Approach for Text Regression Empirical Study and Results For our empirical study, we use 2718 online book reviews collected from Amazon.com. We next develop text regression models for predicting review helpfulness and empirically evaluate their performance. Each observation in the datasets is a unique review. Preprocessing of Review Text for BOW-based Datasets To generate datasets for the BOWFull model, we go through the following process. First, we tokenize the collected review text by removing all non-alphabetic characters. Next, we apply the Standard English stop-words filter to eliminate common stop-words, which do not carry substantial meaning. Then we cast all the terms to lower case. Finally, we employ the popular Porter Stemmer algorithm to reduce terms to their basic form. We have all the stemmed words that appear in our review text collection, which are used as independent variables for the BOWFull model. We use two types of index weighting schemes for BOW-based datasets. In the binary occurrence (BinaOccu) scheme, if a word appears in a review text, it is coed as 1. The absent word in a review text is coded as 0. In the term occurrence (TermOccu) scheme, the raw word count in a document is used as the term weight. In Table 1, we present the models, instantiated datasets, and the number of variables. Table 1. Instantiated Models and Number of Variables (including target class) Index Weighting Conceptual Models Datasets Number of Variables BinaOccu BOWFull A2718BinaOccu 7543 Filter Only A2718BinaOccuRel Hybrid (Filter + Wrapper) A2718BinaOccuRel300WRPGRF 98

5 Hybrid Text Regression Model for Predicting Review Helpfulness TermOccu BOWFull A2718TermOccu 7543 Filter Only A2718TermOccuRel Hybrid (Filter + Wrapper) A2718TermOccuRel300WRPGRF 83 Five Regression Performance Measures To gauge the regression performance of different models, we consider five measures (see Table 2). Four of them are error-based measures, where a smaller realized value implies better performance. One additional measure is the correlation coefficient, for which a larger realized value indicates better performance. Table 2. Five Regression Performance Measures MAE RMSE RAE RRSE CORR Mean absolute error Root mean squared error Relative absolute error Root relative squared error Correlation coefficient Regression Algorithm and Experiment Configuration For the BOW-based models, we adopt the state-of-the-art support vector regression (SVR) algorithm LibSVM s epsilon SVR (ε-svr) with linear kernel for our text regression problem. For the wrapper method, we use SVR to evaluate the features based on five-fold cross validation. For the search method of wrapper, we experiment with Greedy Stepwise. To achieve reliable estimation of model performance, we conduct 10 runs and within each run we have 10-fold cross validation (Witten et al. 2011). We report the overall average across 10 runs and 10 folds as the performance measure for a model. To compare two models, we match the corresponding 100 observations and run paired t- tests. Results of Text Mining Experiments To examine the efficacy of the proposed hybrid model, we compare it with the filter only model, the BOWFull model, and the ZeroR model. We set up the following comparison scenarios. First, we have two index weighting schemes (BinaOccu and TermOccu). Second, we have five regression performance measures. Therefore, we have a total of 2 X 5 = 10 unique scenarios. In Table 3, we report the regression performance of all models. We highlight the best one in each scenario in bold. We also conduct pairwise comparisons between the hybrid model and the filter only model. Table 3. Regression Performance of Hybrid, Filter, BOWFull, and ZeroR Models Performance Measures Index Weighting ZeroR BOWFull Filter Only Hybrid (Filter + Wrapper) MAE RMSE RAE RRSE CORR BinaOccu ** TermOccu ** BinaOccu ** TermOccu ** BinaOccu ** TermOccu ** BinaOccu ** TermOccu * BinaOccu ** TermOccu * From Table 3, we find that the hybrid models clearly outperform filter only models, BOWFull models, and ZeroR models. We also conduct paired t-tests between the hybrid model and the filter only model. In all

6 Research Track: DW/BI/Analytics from an application perspective 10 scenarios, the differences are statistically significant (* denotes p < 0.05 and ** denotes p < 0.001). Such consistent results strongly indicate the influence of the wrapper. In other words, the hybrid model of combining the filter and wrapper approaches is better than the filter only model. The hybrid approach proposed in this paper leads to better performance. In summary, among all the models examined in this study, the hybrid model is the most accurate one for predicting review helpfulness. The results also show that both the BOWFull model and the filter only model outperform the ZeroR model. Moreover, the filter only model performs better than the BOWFull model. Taken together, the empirical findings suggest that the conceptual framework presented in Figure 1 is a useful methodology for improving review helpfulness prediction incrementally. Next, we report the run times of the BOW-based models in Table 4. Table 4. Run Times for Hybrid, Filter, and BOWFull Models (in seconds) Index Weighting BOWFull Filter Only Hybrid (Filter + Wrapper) BinaOccu TermOccu Table 4 shows that the run time dramatically decreases from the BOWFull model to the filter only model, and again from the filter only model to the hybrid model. In real-world applications, after the features are determined by the feature selection process, the hybrid model has a strong computational advantage over the filter only model. The reason is that through the wrapper process, a much smaller feature set is selected for the hybrid model (see Table 1 for the number of variables in different BOW-based models). However, we need to point out that the enhanced accuracy and running time of the final hybrid model come with the additional computation cost of applying a wrapper method to select an optimal subset of features. But the computation cost of applying a wrapper is a one-time investment. After a subset of features for a review domain is determined, it can be reused over and over again. The significantly higher runtime efficiency of the hybrid approach makes it a very attractive option to employ. Conclusion and Future Directions It is worth noting that one of the main objectives of our study, if realized, would potentially provide a major benefit to business websites within a BI 2.0 context. That objective is to facilitate the estimation of the helpfulness of new reviews instantly so that businesses can dynamically adjust the presentation of those reviews. Toward that end, the proposed hybrid approach provides an innovative and practical framework for BI 2.0 applications. The empirical results demonstrate the applicability of the hybrid approach for the domain of predicting online review helpfulness. In this ongoing study, we are currently undertaking several experiments. In addition to the Greedy Stepwise search method, we are experimenting with Best First and Genetic Search. We employed SVR as the learning algorithm and also as the wrapper. Although SVR represents the state-of-the-art, we are currently examining other competitive algorithms. Since the target variable the number of helpful votes is a positive integer and SVR assumes the dependent variable to be a real number, we are considering other algorithms such as Poisson regression, which may be more suitable for the problem of predicting an integer outcome variable. For the dimension reduction techniques, we focused on the Regressional ReliefF for the filter-based approach. We are also examining other feature selection techniques, such as correlation-based feature selection, that have been reported to be effective. We empirically tested our proposed framework against book review data collected from Amazon.com. To make the findings more generalizable, we are evaluating other types of online reviews. We expect to have most of these results at the time of the BI Congress. Our current work differs from (Chou et al. 2010) in several ways. Their focus was on Internet abuse detection, whereas we focus on an important BI problem, that of predicting online review helpfulness. The major difference is that while they developed a hybrid text classification approach, we develop and test a hybrid text regression approach. We expand upon Ngo-Ye and Sinha s (2012) work by applying the hybrid approach to enhance their filter only model. This ongoing study makes important contributions to the literature. To the best of our knowledge, this is the first study to develop a hybrid feature selection approach for predicting online review helpfulness. It is

7 Hybrid Text Regression Model for Predicting Review Helpfulness also the first study to develop and apply a hybrid approach in the context of text regression. We build on the work of Chou et al. (2010), who had developed a hybrid text classification approach for detecting internet abuse. The initial findings of our study indicate that the hybrid approach is also attractive for text regression problems. The proposed hybrid framework, which makes use of both the filtered method and the wrapper method, is intuitively appealing and conceptually meaningful. The results from the empirical experiments we conducted demonstrate the viability and effectiveness of the proposed approach. With many interesting and challenging research issues ahead, the field of online social media is a fertile ground for research in the information systems discipline. Because our proposed framework is fairly general, it can be applied to different online social media studies. Given the increasing trend of exploiting Facebook and Twitter data for business intelligence, it would be interesting to conduct new empirical studies testing and extending our proposed framework in the context of tweets and Facebook posts and comments. The proposed framework could prove to be an effective and efficient method for businesses to harness market intelligence from online social media.

8 Research Track: DW/BI/Analytics from an application perspective References Abbasi, A., and Chen, H "CyberGate: A System and Design for Text Analysis of Computer Mediated Communications," MIS Quarterly (32:4), pp Abbasi, A., Chen, H., and Salem, A "Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums," ACM Transactions on Information Systems (26:3), pp. 12:1-12:34. Chou, C.-H., Sinha, A. P., and Zhao, H "A Hybrid Attribute Selection Approach for Text Classification," Journal of the Association for Information Systems (11:9), pp Duan, W., Cao, Q., and Gan, Q "Investigating Determinants of Voting for the "Helpfulness" of Online Consumer Reviews: A Text Mining Approach," in Proceedings of the Sixteenth Americas Conference on Information Systems, M. Santana, J. Luftman, A. Vinzé (eds.), Lima, Peru, pp Ghose, A., and Ipeirotis, P. G "Designing Novel Review Ranking Systems: Predicting Usefulness and Impact of Reviews," in Proceedings of the International Conference on Electronic Commerce (ICEC), Minneapolis, Minnesota, pp Hall, M. A., and Holmes, G "Benchmarking Attribute Selection Techniques for Discrete Class Data Mining," IEEE Transactions on Knowledge and Data Engineering (15:3), pp Kim, S.-M., Pantel, P., Chklovski, T., and Pennacchiotti, M "Automatically Assessing Review Helpfulness," in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, Australia, pp Kohavi, R., and John, G. H "Wrappers for Feature Subset Selection," Artificial Intelligence (97:1-2), pp Liu, B "Sentiment Anlaysis and Subjectivity," in Handbook of Natural Language Processing, N. Indurkhya, and F. J. Damerau (eds.), Second Edition, pp Liu, J., Cao, Y., Lin, C.-Y., Huang, Y., and Zhou, M "Low-quality Product Review Detection in Opinion Summarization," in Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, pp Mudambi, S. M., and Schuff, D "What Makes A Helpful Online Review? A Study Of Customer Reviews On Amazon.com," (C. Saunders, Ed.) MIS Quarterly (34:1), pp Ngo-Ye, T. L., and Sinha, A. P "Analyzing Online Review Helpfulness Using a Regressional ReliefF- Enhanced Text Mining Method," ACM Transactions on Management Information Systems (3:2), pp. 10:1-10:20. Pang, B., and Lee, L "Opinion Mining and Sentiment Analysis," in Foundations and Trends in Information Retrieval, Vol. 2, pp Robnik-Sikonja, M., and Kononenko, I "Theoretical and Empirical Analysis of ReliefF and RReliefF," Machine Learning (53:1/2), pp Sebastiani, F "Machine Learning in Automated Text Categorization," ACM Computing Surveys (34:1), pp Turban, E., Sharda, R., Delen, D., and King, D Business Intelligence: A Managerial Approach, Second Edition, Upper Saddle River, NJ: Prentice Hall. Turney, P. D., and Pantel, P "From Frequency to Meaning: Vector Sapce Models of Semantics," Journal of Artificial Intelligence Research (37:January April), pp Watson, H. J., and Wixom, W. H "The Current State of Business Intelligence," IEEE Computer (40:9), pp Witten, I. H., Frank, E., and Hall, M. A Data Mining: Practical Machine Learning Tools and Techniques, Third Edition, Burlington, MA, USA: Morgan Kaufmann. Zhang, Z "Weighing Stars: Aggregating Online Product Reviews for Intelligent E-commerce Applications," IEEE Intelligent Systems, (September/October), pp Zhang, Z., and Varadarajan, B "Utility Scoring of Product Reviews," in Proceedings of the ACM SIGIR Conference on Information and Knowledge Management (CIKM), Arlington, Virginia, pp

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

More information

Designing Ranking Systems for Consumer Reviews: The Impact of Review Subjectivity on Product Sales and Review Quality

Designing Ranking Systems for Consumer Reviews: The Impact of Review Subjectivity on Product Sales and Review Quality Designing Ranking Systems for Consumer Reviews: The Impact of Review Subjectivity on Product Sales and Review Quality Anindya Ghose, Panagiotis G. Ipeirotis {aghose, panos}@stern.nyu.edu Department of

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University huanjing.wang@wku.edu Taghi M. Khoshgoftaar

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services

Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services Ms. M. Subha #1, Mr. K. Saravanan *2 # Student, * Assistant Professor Department of Computer Science and Engineering Regional

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Business Challenges and Research Directions of Management Analytics in the Big Data Era

Business Challenges and Research Directions of Management Analytics in the Big Data Era Business Challenges and Research Directions of Management Analytics in the Big Data Era Abstract Big data analytics have been embraced as a disruptive technology that will reshape business intelligence,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania ruxandra_stefania.petre@yahoo.com Over

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

A Brief Tutorial on Database Queries, Data Mining, and OLAP

A Brief Tutorial on Database Queries, Data Mining, and OLAP A Brief Tutorial on Database Queries, Data Mining, and OLAP Lutz Hamel Department of Computer Science and Statistics University of Rhode Island Tyler Hall Kingston, RI 02881 Tel: (401) 480-9499 Fax: (401)

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics

Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics Boullé Orange Labs avenue Pierre Marzin 3 Lannion, France marc.boulle@orange.com ABSTRACT We describe our submission

More information

Predicting IMDB Movie Ratings Using Social Media

Predicting IMDB Movie Ratings Using Social Media Predicting IMDB Movie Ratings Using Social Media Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke ISLA, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Contact Recommendations from Aggegrated On-Line Activity

Contact Recommendations from Aggegrated On-Line Activity Contact Recommendations from Aggegrated On-Line Activity Abigail Gertner, Justin Richer, and Thomas Bartee The MITRE Corporation 202 Burlington Road, Bedford, MA 01730 {gertner,jricher,tbartee}@mitre.org

More information

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM M. Mayilvaganan 1, S. Aparna 2 1 Associate

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Relational Learning for Football-Related Predictions

Relational Learning for Football-Related Predictions Relational Learning for Football-Related Predictions Jan Van Haaren and Guy Van den Broeck jan.vanhaaren@student.kuleuven.be, guy.vandenbroeck@cs.kuleuven.be Department of Computer Science Katholieke Universiteit

More information

How To Predict Web Site Visits

How To Predict Web Site Visits Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many

More information

Getting Even More Out of Ensemble Selection

Getting Even More Out of Ensemble Selection Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand qs12@cs.waikato.ac.nz ABSTRACT Ensemble Selection uses forward stepwise

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream

More information

Software project cost estimation using AI techniques

Software project cost estimation using AI techniques Software project cost estimation using AI techniques Rodríguez Montequín, V.; Villanueva Balsera, J.; Alba González, C.; Martínez Huerta, G. Project Management Area University of Oviedo C/Independencia

More information

How To Solve The Kd Cup 2010 Challenge

How To Solve The Kd Cup 2010 Challenge A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Robust Sentiment Detection on Twitter from Biased and Noisy Data

Robust Sentiment Detection on Twitter from Biased and Noisy Data Robust Sentiment Detection on Twitter from Biased and Noisy Data Luciano Barbosa AT&T Labs - Research lbarbosa@research.att.com Junlan Feng AT&T Labs - Research junlan@research.att.com Abstract In this

More information

Data Mining Algorithms and Techniques Research in CRM Systems

Data Mining Algorithms and Techniques Research in CRM Systems Data Mining Algorithms and Techniques Research in CRM Systems ADELA TUDOR, ADELA BARA, IULIANA BOTHA The Bucharest Academy of Economic Studies Bucharest ROMANIA {Adela_Lungu}@yahoo.com {Bara.Adela, Iuliana.Botha}@ie.ase.ro

More information

The Optimality of Naive Bayes

The Optimality of Naive Bayes The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

More information

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90 FREE echapter C H A P T E R1 Big Data and Analytics Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90 percent of the data in the

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

Semi-Supervised Learning for Blog Classification

Semi-Supervised Learning for Blog Classification Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

More information

Equity forecast: Predicting long term stock price movement using machine learning

Equity forecast: Predicting long term stock price movement using machine learning Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK Nikola.milosevic@manchester.ac.uk Abstract Long

More information

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph Janani K 1, Narmatha S 2 Assistant Professor, Department of Computer Science and Engineering, Sri Shakthi Institute of

More information

Data Warehousing and Data Mining in Business Applications

Data Warehousing and Data Mining in Business Applications 133 Data Warehousing and Data Mining in Business Applications Eesha Goel CSE Deptt. GZS-PTU Campus, Bathinda. Abstract Information technology is now required in all aspect of our lives that helps in business

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction

More information

and Hung-Wen Chang 1 Department of Human Resource Development, Hsiuping University of Science and Technology, Taichung City 412, Taiwan 3

and Hung-Wen Chang 1 Department of Human Resource Development, Hsiuping University of Science and Technology, Taichung City 412, Taiwan 3 A study using Genetic Algorithm and Support Vector Machine to find out how the attitude of training personnel affects the performance of the introduction of Taiwan TrainQuali System in an enterprise Tung-Shou

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

A NURSING CARE PLAN RECOMMENDER SYSTEM USING A DATA MINING APPROACH

A NURSING CARE PLAN RECOMMENDER SYSTEM USING A DATA MINING APPROACH Proceedings of the 3 rd INFORMS Workshop on Data Mining and Health Informatics (DM-HI 8) J. Li, D. Aleman, R. Sikora, eds. A NURSING CARE PLAN RECOMMENDER SYSTEM USING A DATA MINING APPROACH Lian Duan

More information

Utility of Distrust in Online Recommender Systems

Utility of Distrust in Online Recommender Systems Utility of in Online Recommender Systems Capstone Project Report Uma Nalluri Computing & Software Systems Institute of Technology Univ. of Washington, Tacoma unalluri@u.washington.edu Committee: nkur Teredesai

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Automated Content Analysis of Discussion Transcripts

Automated Content Analysis of Discussion Transcripts Automated Content Analysis of Discussion Transcripts Vitomir Kovanović v.kovanovic@ed.ac.uk Dragan Gašević dgasevic@acm.org School of Informatics, University of Edinburgh Edinburgh, United Kingdom v.kovanovic@ed.ac.uk

More information

Introduction to time series analysis

Introduction to time series analysis Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples

More information

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Analytics for Business Intelligence and Decision Support Data Mining Analytics for Business Intelligence and Decision Support Chid Apte, T.J. Watson Research Center, IBM Research Division Knowledge Discovery and Data Mining (KDD) techniques are used for analyzing

More information

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015 Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015

More information

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction

A Big Data Analytical Framework For Portfolio Optimization Abstract. Keywords. 1. Introduction A Big Data Analytical Framework For Portfolio Optimization Dhanya Jothimani, Ravi Shankar and Surendra S. Yadav Department of Management Studies, Indian Institute of Technology Delhi {dhanya.jothimani,

More information

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: Jerzy.Stefanowski@cs.put.poznan.pl Data Mining a step in A KDD Process Data mining:

More information

Cleaned Data. Recommendations

Cleaned Data. Recommendations Call Center Data Analysis Megaputer Case Study in Text Mining Merete Hvalshagen www.megaputer.com Megaputer Intelligence, Inc. 120 West Seventh Street, Suite 10 Bloomington, IN 47404, USA +1 812-0-0110

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA ABSTRACT Current trends in data mining allow the business community to take advantage of

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

On the Effectiveness of Obfuscation Techniques in Online Social Networks

On the Effectiveness of Obfuscation Techniques in Online Social Networks On the Effectiveness of Obfuscation Techniques in Online Social Networks Terence Chen 1,2, Roksana Boreli 1,2, Mohamed-Ali Kaafar 1,3, and Arik Friedman 1,2 1 NICTA, Australia 2 UNSW, Australia 3 INRIA,

More information

Classification of Learners Using Linear Regression

Classification of Learners Using Linear Regression Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software

More information

Studying Auto Insurance Data

Studying Auto Insurance Data Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.

More information

Pattern-Aided Regression Modelling and Prediction Model Analysis

Pattern-Aided Regression Modelling and Prediction Model Analysis San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

DECISION SUPPORT SYSTEMS OR BUSINESS INTELLIGENCE: WHAT CAN HELP IN DECISION MAKING?

DECISION SUPPORT SYSTEMS OR BUSINESS INTELLIGENCE: WHAT CAN HELP IN DECISION MAKING? DECISION SUPPORT SYSTEMS OR BUSINESS INTELLIGENCE: WHAT CAN HELP IN DECISION MAKING? Hana Kopáčková, Markéta Škrobáčková Institute of System Engineering and Informatics, Faculty of Economics and Administration,

More information

Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks

Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks The 4 th China-Australia Database Workshop Melbourne, Australia Oct. 19, 2015 Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks Jun Xu Institute of Computing Technology, Chinese Academy

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

MANY factors affect the success of data mining algorithms on a given task. The quality of the data

MANY factors affect the success of data mining algorithms on a given task. The quality of the data IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 3, MAY/JUNE 2003 1 Benchmarking Attribute Selection Techniques for Discrete Class Data Mining Mark A. Hall, Geoffrey Holmes Abstract Data

More information

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,

More information

Sales Forecast for Pickup Truck Parts:

Sales Forecast for Pickup Truck Parts: Sales Forecast for Pickup Truck Parts: A Case Study on Brake Rubber Mojtaba Kamranfard University of Semnan Semnan, Iran mojtabakamranfard@gmail.com Kourosh Kiani Amirkabir University of Technology Tehran,

More information

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin Network Big Data: Facing and Tackling the Complexities Xiaolong Jin CAS Key Laboratory of Network Data Science & Technology Institute of Computing Technology Chinese Academy of Sciences (CAS) 2015-08-10

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES

COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES COMBINING THE METHODS OF FORECASTING AND DECISION-MAKING TO OPTIMISE THE FINANCIAL PERFORMANCE OF SMALL ENTERPRISES JULIA IGOREVNA LARIONOVA 1 ANNA NIKOLAEVNA TIKHOMIROVA 2 1, 2 The National Nuclear Research

More information

SENG 520, Experience with a high-level programming language. (304) 579-7726, Jeff.Edgell@comcast.net

SENG 520, Experience with a high-level programming language. (304) 579-7726, Jeff.Edgell@comcast.net Course : Semester : Course Format And Credit hours : Prerequisites : Data Warehousing and Business Intelligence Summer (Odd Years) online 3 hr Credit SENG 520, Experience with a high-level programming

More information

Beating the MLB Moneyline

Beating the MLB Moneyline Beating the MLB Moneyline Leland Chen llxchen@stanford.edu Andrew He andu@stanford.edu 1 Abstract Sports forecasting is a challenging task that has similarities to stock market prediction, requiring time-series

More information

Maximize Social Media Effectiveness with Data Science. An Insurance Industry White Paper from Saama Technologies, Inc.

Maximize Social Media Effectiveness with Data Science. An Insurance Industry White Paper from Saama Technologies, Inc. Maximize Social Media Effectiveness with Data Science An Insurance Industry White Paper from Saama Technologies, Inc. February 2014 Table of Contents Executive Summary 1 Social Media for Insurance 2 Effective

More information

Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER

Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER Populating a Data Quality Scorecard with Relevant Metrics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Useful vs. So-What Metrics... 2 The So-What Metric.... 2 Defining Relevant Metrics...

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information