Statistical Feature Selection Techniques for Arabic Text Categorization
|
|
- Chloe Booker
- 8 years ago
- Views:
Transcription
1 Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid Jordan Tel ext ABSTRACT This paper compares a few statistical feature selection techniques for Arabic text. Feature selection is especially important for text classification because, when dealing with text, the number of features/words increases rapidly. This makes the document-term matrix a sparse one which affects the performance of classifiers in terms of accuracy and in terms of processing time. One opts to reduce the number of features during preprocessing by eliminating stopwords, or stemming. This paper adds one step to preprocessing by reducing the number of features based on their merit as determined by some statistical measures such as correlation, Chi square, deviation, and uncertainty. The dataset that was used in this study consists of 4,000 Arabic documents that fall into two categories. Naïve Bayes with 3-fold cross validation is used in this study. Results show that weight by correlation gave very high accuracy with very little time. Still, the highest accuracy was achieved in the base case, when no filters were used, but it took the classifier twice the time required with weight by correlation. Categories and Subject Descriptors H.2.8. Data mining. I.2.7 Natural language processing. I.5.2 Classifier design and evaluation Feature evaluation and selection. General Terms Algorithms, Management, Performance, Design, Experimentation. Keywords Text categorization, text classification, feature selection, Chi square, weight by correlation 1. INTRODUCTION Text classification or categorization is the process of assigning a document to one or more labels or categories [12]. The first is known as single-label classification. The latter is addressed in the literature as multi-label classification [15]. Text categorization is a supervised process since the set of labels is known a priori. It has many applications such as spam filtering, topic identification, and tailored news or advertisement delivery. The exponential growth of the number of features is common in text categorization [12]. Since every unique token in every document would appear as a dimension in the document-term matrix. Most text categorization techniques reduce this large number of features by eliminating stopwords, or stemming. This is effective to a certain extent but the remaining number of features is still huge. More sophisticated feature selection techniques have been reported in the literature [1, 2, 4, 9, 16, 17, 18]. One needs to distinguish between feature selection and feature reduction. In feature selection, a subset of the features in the document-term matrix is selected based on their merit in the classification process [14]. These fall into filter, and wrapper approaches. Filters rely on a measure (most often a statistical one) to calculate the merit of a given feature. Only features which weight is greater than a threshold are kept for classification. Filters are independent from the classifier and therefore they are inexpensive to use. Examples of filters include weight by correlation, chi square, information gain [8], mutual information [10], Gini index [13] and many more. Wrappers [1], by comparison, evaluate features by training a classifier; this means that the classifier s accuracy is calculated over several subsets of features determined by greedy algorithms. Wrappers yield better results but they are expensive and may suffer from overfitting. Feature reduction, on the other hand, transforms the original set of features into new features by applying some transformation function. The new feature set contains far fewer features or dimensions than the original set [2]. Common feature reduction techniques include term clustering [3], latent semantic indexing [7, 9], and principal component analysis [2, 19]. These are also expensive to use but they yield good results over a reduced dimension feature space. In this paper, we investigate the effects of weight by correlation, chi square, deviation and uncertainty filters on the classification accuracy. The Naïve Bayes classifier was used in this study. The dataset, that we experimented with, consists of 4000 Arabic The Fourth International Conference on Information and Communication Systems (ICICS 2013), Irbid, Jordan, April, 23-25, 2013.
2 documents collected from the Internet and labeled manually. The results indicate that weight by correlation achieves high classification accuracy and minimum time to complete the classification. Working on the set of all features, after removing stopwords and stemming, gave the maximum accuracy but on the expense of processing time. This paper is organized as follows. Section 1 has introduced this work. Section 2, by comparison, explains the necessary background and related work. Section 3, on the other hand, explains the research methodology followed in this work. Section 4 discusses the experimentations and obtained results. Finally, Section 5, draws the conclusions of this work. 2. BACKGROUND AND RELATED WORK Feature selection and feature reduction are well-studied problems in the literature. Several sophisticated algorithms have been introduced and used [1, 2, 4, 9, 16, 17, 18]. The bulk of these works addressed structured data. Text classification or categorization aim at assigning a label to a document. Since text is unstructured, words or tokens which appear in the collection of documents are used as features or attributes. The importance of a word or term is commonly determined by using TFIDF (term frequency inverse document frequency). Usually, the number of features or terms is huge and this could adversely affect accuracy as a result of overfitting or it increases processing time. Thus feature selection is commonly used with text classification. Feature selection/reduction for Arabic text has been investigated in the literature. Duwairi, Refae and Khasawneh [3] investigated three heuristics for feature reduction in Text classification. Specifically, stemming, light stemming and word clustering were used. The dataset consists of documents. Naïve Bayes and KNN were used for classification. Their results show that light stemming supersedes the other two methods. Mesleh [10] compared the effects of 17 feature selection techniques on the accuracy of SVM when applied to a corpus that consists of 7842 documents written in Arabic. He concluded that Chi-square leads the best precision and recall values. Mesleh did not apply stemming on the text but he applied normalization of certain letters (such as hamza ( ), alef mad ( ), alef with hamza on top (>), alef with Hamza on bottom (<), hamza on ya (}) are normalized to alef without Hamza (A)). Stopwords were also removed from the text documents. We are using Bukwalter s transliteration system for writing Arabic letters. Harrag and El-Qawasmeh [6] have used SVD (Singular Value Decomposition) with Neural Networks to classify documents. Their work falls under feature reduction rather than selection. Harag, El-Qawasmeh & Pichappan [5] applied decision tree for classifying Arabic documents on two rather small dataset collected manually. The accuracy of the classifier was calculated once without feature selection and once with feature selection. The feature selection measures that were used are term frequency (TF), document frequency (DF) and their ratio (TF/DF). The best filter is determined for each dataset and then the decision tree was used to classify documents. Their results demonstrate feature selection improves accuracy. The features that they used are well known and well used in the literature. 3. RESEARCH METHODOLOGY 3.1 Data Set The dataset consists of 4,000 Arabic documents which fall into two categories. The first category contains documents that are related to economics and therefore it is labeled Economics. This category consists of 2,000 documents. The second category, on the other hand, contains documents that are related to Politics activities and therefore it is labeled Politics. The Politics category consists of 2,000 documents as well. This dataset was created manually by collecting documents from the internet. It was labeled manually as well. 3.2 Preprocessing Rapidminer [11] was used in the preprocessing and classification tasks. Rapidminer is an open source data mining tool that supports many data mining tasks and it is readily usable for text. Preprocessing, in this work, consists of the following steps carried out in order on every document in the dataset: 1. Replace Tokens: here every non-arabic character is replaced by a whitespace character. The Regular expression [a-za-z] was replaced by s. 2. Tokenize: here the words or tokens of the document are extracted. A token means any sequence of Arabic alphabet. 3. Filter Stopwords: here all stopwords are removed from the documents. We used Rapidminer built-in Arabic stopword list. 4. Stemming: we utilized Rapidminer built-in light stemming algorithm. Stemming in Arabic falls into two classes: light stemming and root extraction (often referred to as stemming). The first means that a word will not be reduced to its three letter root but common prefixes and suffixes are removed. In root extraction, a word is reduced to its three letter root. A very small percentage of Arabic words have quad-literal or penta-literal roots. 3.3 Classification Model Naïve Bayes is used in this research. After preprocessing, the document-term matrix is generated and the importance of a term is expressed using TF-IDF. The accuracy of a classification task is based on Precision and Recall. The precision/recall for every class is calculated. The overall performance of the classifier is determined by averaging precision/recall for the two classes. Because the dataset is relatively small, we utilized 3-fold cross validation. 3.4 Feature Selection Model
3 In order to assess the impact of feature selection on text classification, we first ran the classification task on the dataset without any feature selection apart from the preprocessing that was explained in Section 3.2. We calculated and stored the precision and recall. After that, the classification task was run for several times. At each run, we used a specific feature selection method and accuracy, in terms of precision and recall, was calculated. The idea is to assess the suitability of feature selection for text classification. The following paragraphs explain each of the filters that we have used in this study: Weight by Correlation This measure calculates the weight (degree of association) of attributes with respect to the class label. The weight belongs to [- 1, 1] and could be normalized to [0, 1]. From a classification perspective, the higher the correlation between a feature and a class label is, the better the discrimination power of the feature (i.e. the feature is an important one for classification) is. Weight by Chi Square This filter computes the lack of independence between a feature and a class. It determines if the distribution of observed frequencies is different from expected distribution. The higher the Chi Square values are the better the features are. Weight by Deviation Computes weights based on the (normalized) standard deviation of the attributes. Weight by Uncertainty This operator calculates the importance of a feature by measuring the symmetrical uncertainty with respect to the class. 4. EXPERIMENTATION AND RESULT ANALYSIS The first series of experiments aimed at comparing the performance of Weight by Correlation, Chi Square, Deviation and Uncertainty. Figure 1 shows the average precision and recall for the previous filters compared against the base case. We used Naïve Base Classifier with 3-fold cross-validation. The dataset was preprocessed by removing stopwords, and light-stemming the remaining words. We used a personal computer with 16GB RAM and 3.4GHz i Intel Core CPU with Windows 7 Enterprise 64 bit. As the table shows the base case gave really good results and the subsequent filters did not enhance the classifier s accuracy. Feature selection using Correlation and Uncertainty gave much better accuracy when compared with Chi Square and Deviation. Figure 1: Precision and Recall for Various Filters with Stemming Table 1 shows the time, in seconds, that was necessary to complete the classification task under different filters. As it can be seen from the table, weight by correlation took very little time when compared with the chi-square statistics. Table 1: Classification Time in seconds Filter Name Time in Seconds Base Case 583 Correlation 266 Chi Square 1192 Deviation 505 Uncertainty 1399 Figure 2 plots precision, recall and the time required to complete the classification task. The figure indicates that using weight by correlation gave slightly lower accuracy but it needed only 266 second to complete. By comparison, the base case with no filters gave the highest accuracy but on the expense of time. Feature selection is an optimization problem that takes into consideration several factors including classifier s accuracy and time necessary to complete the classification task. Figure 2: Precision, Recall and Time against Several Filters 5. CONCLUSIONS
4 The work reported in this paper has compared the effects of statistical feature selection techniques on the accuracy of the Naïve Bayes Classifier. Specifically, the effects of weight by correlation, chi square, deviation and uncertainty were investigated. The dataset consists of 4000 Arabic documents which are uniformly distributed over two classes. The results of the experiments demonstrate that weight by correlation gives the highest accuracy when compared with other filters. It also shows that correlation and deviation takes much less time to be computed. Occasionally, slight decrease in accuracy is acceptable in favor of speedy computation. Recall that feature selection runs a classification task with a subset of the features aiming to avoid overfitting (and thus increasing performance), and reducing time. REFERENCES [1] Ahmadizar F, & Hemmati M, & Rabanimotlagh A., Two- Stage Text Feature Selection Method Using Fuzzy Entropy Measure and Ant Colony Optimization. Proceedings of the 20 th Iranian Conference on Electrical Engineering (ICEE), Pages , May 15 17, Tehran, Iran, [2] Anghelescu A. & Muchnik I., Combinatorial PCA and SVM Methods for Feature Selection in Learning Classifications (Applications to Text Categorization), Proceedings of the IEEE KIMAS Conference, Boston, MA, UAS, Oct. 1 3, pages , [3] Duwairi R., Al-Refai M. & Khasawneh N., "Feature Reduction Techniques for Arabic Text Categorization". Journal of the American Society for Information Science and Technology (JASIST), Volume 60, Issue 11, pages: , [4] Figueiredo F. et al, Word Co-occurrence Features for Text Classification, Journal of Information Systems, Vol 36, Pages , [5] Haraq F., El-Qawasmeh E, & Pichappan P., Improving Arabic Text Categorization using Decision Trees, Proceedings of the First International Conference on Networked Digital Technologies, July 28 31, Pages , [6] Harrag F. & El-Qawasmah, E., Improving Arabic Text Categorization Using Neural Network with SVD, Journal of Digital Information Management, Vol. 8, No. 2, pages , [7] Harrag F, El-Qawasmeh E. & Al-Salman A., A Comparative Study of Statistical Feature Reduction Methods for Arabic Text Categorization, Networked Digital Technologies (NDT 2010), Part II, Communications in Computer and Information Science, Vol. 88, Pages , [8] Lee C. & Lee G. G., Information Gain and Divergence-based Feature Selection for Machine Learning-based Text Categorization, Journal of Information Processing and Management, Volume 42, Pages , [9] Meng J., & Lin H., A Two-Stage Feature Selection Method for Text Categorization, Proceedings of the 7th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages , August 10 12, Yanti, China, [10] Mesleh A., Feature Sub-set Selection Metrics for Arabic Text Classification, Journal of Pattern Recognition Letters, Vol. 32, pages , [11] Rapidminer, last accessed 16-Jan [12] Sebastiani, F. Text Categorization. In A. Zanasi (Ed.). Text Mining and Its Applications to Intelligence, CRM and Knowledge Management. Pages Southampton, UK: WIT Press, [13] Sing S. R. & Murthy H. A. & Gonsalves T. A., Feature Selection for Text Classification based on Gini Coefficient of Inequality, Proceeding of 4th JMLR International Workshop on Feature Selection in Data Mining, pages 76 85, June 21, Hyderabad, India, [14] Seo,Y., Ankolekar, A., & Sycara, K. Feature Selection for Extracting Semantically Rich Words. Technical Report CMU- RI-TR-04-18, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA., [15] Tsoumakas G., Katakis I. Multi-Label Classification: An Overview. Journal of Data Warehousing & Mining, Vol. 3, No. 3, Pages 1-13, [16] Uysal A. K. & Gunal S., A novel probabilistic Feature Selection Method for Text Classification, Journal of Knowledge-Based Systems, Vol. 36, Pages , [17] Wang et al, Feature Selection with Maximum Information Metric in Text Categorization, Proceedings of the 1st International Conference on Information Science and Engineering (ICISE), pages , Dec 26-28, Nanjing, China, [18] Wang S. et al, A Feature Selection Method based on Improved Fisher s Discriminant Ratio for Text Sentiment Classification, Journal of Expert Systems with Applications, Vol. 38, pages , [19] Yang J. et al, A New Feature Selection based on Comprehensive Measurement both in Inter-Category and Intra-Category for Text Categorization, Journal of Information Processing and Measurement, Vol. 48, Pages , 2012.
5
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationData Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationHow To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
More informationText Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
More informationW. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
More informationContent-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
More informationWeb Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationBagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
More informationEfficient Bug Triaging Using Text Mining
2185 Efficient Bug Triaging Using Text Mining Mamdouh Alenezi and Kenneth Magel Department of Computer Science, North Dakota State University Fargo, ND 58108, USA Email: {mamdouh.alenezi, kenneth.magel}@ndsu.edu
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationIT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
More informationA Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel Martín-Merino Universidad
More informationBlog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
More informationMachine Learning for Naive Bayesian Spam Filter Tokenization
Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationForecasting stock markets with Twitter
Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationSegmentation and Classification of Online Chats
Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat
More informationImpact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationLess naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems
More informationEnhancing Quality of Data using Data Mining Method
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2, ISSN 25-967 WWW.JOURNALOFCOMPUTING.ORG 9 Enhancing Quality of Data using Data Mining Method Fatemeh Ghorbanpour A., Mir M. Pedram, Kambiz Badie, Mohammad
More informationA SURVEY OF TEXT CLASSIFICATION ALGORITHMS
Chapter 6 A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign
More informationFeature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
More informationMining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
More informationIDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES
FOUNDATION OF CONTROL AND MANAGEMENT SCIENCES No Year Manuscripts Mateusz, KOBOS * Jacek, MAŃDZIUK ** ARTIFICIAL INTELLIGENCE METHODS IN STOCK INDEX PREDICTION WITH THE USE OF NEWSPAPER ARTICLES Analysis
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given
More informationIntroducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
More informationE-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationPredicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
More informationManjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India
Volume 5, Issue 6, June 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Multiple Pheromone
More informationHow To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
More informationPattern-Aided Regression Modelling and Prediction Model Analysis
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and
More informationEnhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
More informationdm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING
dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on
More informationDetecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 p10co977@coed.svnit.ac.in 2 dpr@coed.svnit.ac.in
More informationEfficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
More informationThree types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
More informationData Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationAutomatic Text Processing: Cross-Lingual. Text Categorization
Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationRole of Social Networking in Marketing using Data Mining
Role of Social Networking in Marketing using Data Mining Mrs. Saroj Junghare Astt. Professor, Department of Computer Science and Application St. Aloysius College, Jabalpur, Madhya Pradesh, India Abstract:
More informationSentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though
More informationData Mining Approach For Subscription-Fraud. Detection in Telecommunication Sector
Contemporary Engineering Sciences, Vol. 7, 2014, no. 11, 515-522 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ces.2014.4431 Data Mining Approach For Subscription-Fraud Detection in Telecommunication
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationSURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
International Journal of Computer Engineering and Applications, Volume IX, Issue I, January 15 SURVEY PAPER ON INTELLIGENT SYSTEM FOR TEXT AND IMAGE SPAM FILTERING Amol H. Malge 1, Dr. S. M. Chaware 2
More informationSentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
More informationSpam Filtering using Naïve Bayesian Classification
Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering
More informationCLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationFRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
More informationData Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More informationAuthor Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
More informationThe Enron Corpus: A New Dataset for Email Classification Research
The Enron Corpus: A New Dataset for Email Classification Research Bryan Klimt and Yiming Yang Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213-8213, USA {bklimt,yiming}@cs.cmu.edu
More informationOverview. Background. Data Mining Analytics for Business Intelligence and Decision Support
Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview
More informationUniversité de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationResearch on Sentiment Classification of Chinese Micro Blog Based on
Research on Sentiment Classification of Chinese Micro Blog Based on Machine Learning School of Economics and Management, Shenyang Ligong University, Shenyang, 110159, China E-mail: 8e8@163.com Abstract
More informationLasso-based Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
More informationSentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies
Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts
More informationDECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com
More informationHow To Create A Text Classification System For Spam Filtering
Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering PhD Thesis Khurum Nazir Junejo 2004-03-0018 Advisor: Dr. Asim Karim Department of Computer Science Syed Babar
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationAnalysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
More informationData quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
More informationAnalysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
More informationHow To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationBug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews
Bug Report, Feature Request, or Simply Praise? On Automatically Classifying App Reviews Walid Maalej University of Hamburg Hamburg, Germany maalej@informatik.uni-hamburg.de Hadeer Nabil University of Hamburg
More informationPrediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
More informationNeural Networks for Sentiment Detection in Financial Text
Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.
More informationUsing News Articles to Predict Stock Price Movements
Using News Articles to Predict Stock Price Movements Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 9237 gyozo@cs.ucsd.edu 21, June 15,
More informationA Proposed Algorithm for Spam Filtering Emails by Hash Table Approach
International Research Journal of Applied and Basic Sciences 2013 Available online at www.irjabs.com ISSN 2251-838X / Vol, 4 (9): 2436-2441 Science Explorer Publications A Proposed Algorithm for Spam Filtering
More informationIntroduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007
Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples
More informationA Two-Pass Statistical Approach for Automatic Personalized Spam Filtering
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationA Survey on Product Aspect Ranking
A Survey on Product Aspect Ranking Charushila Patil 1, Prof. P. M. Chawan 2, Priyamvada Chauhan 3, Sonali Wankhede 4 M. Tech Student, Department of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra,
More informationBig Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
More informationHomework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECML-PKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationIntroduction. A. Bellaachia Page: 1
Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.
More information