# Automatic Web Page Classification

Save this PDF as:

Size: px
Start display at page:

## Transcription

3 3 Dataset For evaluating the proposed methods and algorithms I used the WebKB [5] dataset. This dataset contains 8, 221 webpages from several universities, labeled with whether they are student, faculty, staff, department, course, project, or other pages. Figure 1 shows the distribution of documents on categories in this dataset. Figure 1: Distribution of documents on Categories in WebKB dataset 4 Classifiers Used For my experiments, I used four different classifiers. 4.1 k NN In k NN classifier, a document is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors. Different measures of similarity can be used for finding the k nearest neighbors of each document. One popular measure is the cosine similarity which is the cosine of the angle between document vectors: cos θ = d 1.d 2 d 1 d 2 Since documents are normalized, this would be reduced to the dot product of the vectors (d 1.d 2 ). 4.2 Multinomial Naive Bayes A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes theorem with strong (naive) independence assumptions. In this classifier, new examples are assigned to the class that is most likely to have generated the example. While the naive assumption is clearly false in most real world tasks, naive Bayes often performs classiffication very well. Recent approaches to text classiffication have used two different first order probabilistic models for classiffication, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between

4 words and binary word features. Others use a multinomial model [4], that is, a uni-gram language model with integer word counts. It is shown that the Multinomial model usually performs better on larger vocabulary sizes. In this model, prior probability of each class is calculated based on the fraction of the documents in the training set that belong to this class: p(c) = Dc D. In addition, probability of each word given each class is also calculated based on the number of occurences of word w in documents of class c to the total number of words in documents of class c: p(w c) = n(w,c) n(c). But it is very common that some words have not be seen in training examples of a class. We don t want to assign zero probabilites to these words. Therefore an smoothing parameter, α, is used: p(w c) = where W is the size of the vocabulary. n(w, c) + α n(c) + W α Probability of document d which is a concatenation of words w 1,..., w m, being generated by class c can be calculated as p(c) p(w 1 c)... p(w m c). Since we only want to compare these probabilites and the probabilites would approach to zero for large values of m, we can compare their log values: c d = argmax{log p(c) + log p(w c)} c w d Note that α is a parameter between 0 and 1 which can be tuned by training on the training set. 4.3 SVM As an SVM classifer, I used SV M light [6] package. It is an implementation of the support vector machines with a fast optimization algorithm and has built in support for standard kernel functions. But the point is that an SVM classifier can be used in binary classification problems, while the web page classification problem is a multi-class classification problem. For building a k class SVM classifier, I combined k 1 versus others classifiers. Each classifier determines whether the document is in this class or the other remaining classes with a confident degree (Larger positive values indicate document belonging to this class, while larger negative values indicate document belonging to other classes). Then for each test document it would be classified based on the degree of confidence in these classifiers (Classifier with the largest positive value determines class of document). 4.4 Decision Tree Classifier: J48 As another classifier, I was interested in using a decision tree classifier. For this purpose, I used Weka [10] which is an open source collection of data mining software written in Java. I used J48 which is Weka s implementation of the popular C4.5 decision tree algorithm [7]. This algorithm is based on the concept of Information Entropy. C4.5 uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets. It examines the normalized Information Gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. The algorithm then recurs on the smaller sublists.

5 k NN Text Text + Anchor Text + Anchor Text + Anchor + Style + Style + URL k = k = k = k = Multinomial Naive Bayes SVM J Table 2: Classification accuracies for different classifiers on different feature sets 5 Evaluations For parsing the Web pages and extracting text and styles form them, I used Cobra [12] which is a Java HTML Renderer & Parser. The reason for choosing this parser among the various available parsers was its support for HTML 4, Javascript, and CSS 2. This means that by using this parser, in addition to extracting the text of the page, I could also find the font and style (bold, italic, underline) of every chunk of text. For evaluating classifiers, we need to split the dataset into a training set and a test set. In my experiments, I randomly split the dataset into a training set covering %80 of documents and a test set covering %20 of documents. For more accurate results, I repeated this random selection of train and test sets several times and reported average results. Table 2 shows average of classification accuracies portion of correctly classified test documents of different classifiers on different feature sets. As can be seen in this table, SVM performs better than the other classifiers on all feature sets. In addition, most of the classifers perform better when more features are considered for documents. Figure 2 is a better visualization of results. Note that the α parameter of the multinomial naive bayes model is trained by splitting the training set into train eval sets. Figure 3 is a sample training of this parameter on a log scale which shows how different values of this parameter can affect classification accuracy. It is worth mentioning that these classifiers belong to four different categories of classifiers with very different properties. For example, k NN is an instance based classifier, meaning that the training instances are needed in the testing phase. J48 is the slowest in the training phase while it is fastest in the testing phase. When considering the sum of training time and testing time required, Multinomial is the fastest classifier and J48 is the slowest one. 6 Related Work In the literature, different authors have tried different classifiers to the task of web page classification. In [1], authors have used SVM for classifying web pages. In [2] authors have used PCA for feature selection and then applied Neural Networks for classification of pages. In [3] authors have proposed a summarization based approach for reducing noise in classifying web pages. In [11], authors have used separate features to train individual classifiers, whose outputs

6 Figure 2: Trends of classification accuracies of different classifiers are combined to give a final category for each page. 7 Conclusion In this project, I studied the problem of automatic Web page categorization. Using experiments performed on a standard dataset, I studied how adding several features can improve the classification accuracy. I used several different classifiers on the dataset and results show that all of these classifiers perform better in the hybrid approach where all of the text, anchors, style and URL features are present. In addition, results show that on this dataset, the SVM classifier outperforms other classifiers with a good margin on all of the feature sets. References [1] Sun A., Lim E., Ng W., (2002) Web Classification Using Support Vector Machines, in Proceedings of the fourth international workshop on Web information and data management, pp ACM Press. [2] Selamat A., Omatu S., (2004) Web page feature selection and classification using neural networks, Information Sciences-Informatics and Computer Science, Volume 158, Issue 1, pp [3] Shen. D., et al, (2004) Web page Classification through Summarization, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp , ACM Press. [4] McCallum A., Nigam K. (1998) A Comparison of Event Models for Naive Bayes Text Classiffication. [5] World Wide Knowledge Base Project, Available online at: webkb/ [6] SVM Light Support Vector Machines,

7 Accuracy α Figure 3: Training α for Multinomial Naive Bayes Model [7] Quinlan J. R., (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. [8] Porter M. F., (1997) An Algorithm for suffix stripping, Readings in information retrieval, pp , Morgan Kaufmann Publishers Inc. [9] Ghani R., et al, (2001) Hypertext Categorization using Hyperlink patterns and Meta data, In Proc. of 18th International Conference on Machine Learning, pp , Morgan Kaufmann Publishers. [10] Weka 3: Data Mining with Open Source Machine Learning Software in Java, [11] Bennett P., Dumais S., Horvitz E., (2002) Probabilistic combination of text classifiers using reliability indicators: Models and results, In Proc. of SIGIR 02, pp [12] Cobra: Pure Java HTML Renderer & Parser,

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

### Content-Based Recommendation

Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

### Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

### A Comparison of Event Models for Naive Bayes Text Classification

A Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum mccallum@justresearch.com Just Research 4616 Henry Street Pittsburgh, PA 15213 Kamal Nigam knigam@cs.cmu.edu School of Computer

### Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

### Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

### Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Statistics and The War on Spam David Madigan Rutgers University Abstract Text categorization algorithms assign texts to predefined categories. The study of such algorithms has a rich history dating back

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### MACHINE LEARNING. Introduction. Alessandro Moschitti

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule Lectures Tuesday, 14:00-16:00

### Predicting Student Performance by Using Data Mining Methods for Classification

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance

### International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

### Feature Selection with Decision Tree Criterion

Feature Selection with Decision Tree Criterion Krzysztof Grąbczewski and Norbert Jankowski Department of Computer Methods Nicolaus Copernicus University Toruń, Poland kgrabcze,norbert@phys.uni.torun.pl

### Representation of Electronic Mail Filtering Profiles: A User Study

Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 +1 949 824 5888 pazzani@ics.uci.edu

### Data mining knowledge representation

Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness

### Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,

### TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

### Music Classification by Composer

Music Classification by Composer Janice Lan janlan@stanford.edu CS 229, Andrew Ng December 14, 2012 Armon Saied armons@stanford.edu Abstract Music classification by a computer has been an interesting subject

### Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Classification with Hybrid Generative/Discriminative Models

Classification with Hybrid Generative/Discriminative Models Rajat Raina, Yirong Shen, Andrew Y. Ng Computer Science Department Stanford University Stanford, CA 94305 Andrew McCallum Department of Computer

### CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### A New Approach for Evaluation of Data Mining Techniques

181 A New Approach for Evaluation of Data Mining s Moawia Elfaki Yahia 1, Murtada El-mukashfi El-taher 2 1 College of Computer Science and IT King Faisal University Saudi Arabia, Alhasa 31982 2 Faculty

### AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo 71251911@mackenzie.br,nizam.omar@mackenzie.br

### Big Data Analytics and Optimization

Big Data Analytics and Optimization C e r t i f i c a t e P r o g r a m i n E n g i n e e r i n g E x c e l l e n c e C e r t i f i c a t e P r o g r a m s i n A c c e l e r a t e d E n g i n e e r i n

### T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

### Data Mining Techniques for Prognosis in Pancreatic Cancer

Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Investigation of Support Vector Machines for Email Classification

Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software

### Data Mining Part 5. Prediction

Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification

### Semi-Supervised Learning for Blog Classification

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi-Supervised Learning for Blog Classification Daisuke Ikeda Department of Computational Intelligence and Systems Science,

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

### Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

### A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Chapter 6 A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign

### Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

### Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

### Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status

Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour?

Seventh IEEE International Conference on Data Mining Can the Content of Public News be used to Forecast Abnormal Stock Market Behaviour? Calum Robertson Information Research Group Queensland University

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data

### DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Training Set Properties and Decision-Tree Taggers: A Closer Look

Training Set Properties and Decision-Tree Taggers: A Closer Look Brian R. Hirshman Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 hirshman@cs.cmu.edu Abstract This paper

### Face Recognition using SIFT Features

Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Projektgruppe. Categorization of text documents via classification

Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Mining Text Data: An Introduction

Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Effective Email

### Email Classification Using Data Reduction Method

Email Classification Using Data Reduction Method Rafiqul Islam and Yang Xiang, member IEEE School of Information Technology Deakin University, Burwood 3125, Victoria, Australia Abstract Classifying user

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

### A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

### Improving spam mail filtering using classification algorithms with discretization Filter

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) International Journal of Emerging Technologies in Computational

### Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

### Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters Wei-Lun Teng, Wei-Chung Teng

### Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

CS-889 Spring 2011 Project 2: Term Clouds (HOF) Implementation Report Members: Nicole Sparks (project leader), Charlie Greenbacker Abstract: This report describes the methods used in our implementation

### Applying Machine Learning to Stock Market Trading Bryce Taylor

Applying Machine Learning to Stock Market Trading Bryce Taylor Abstract: In an effort to emulate human investors who read publicly available materials in order to make decisions about their investments,

### II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

### Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Ensemble Data Mining Methods

Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### Author Gender Identification of English Novels

Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

### Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

### Web Content Classification: A Survey

Web Content Classification: A Survey Prabhjot Kaur Dept. of Computer Science and Engg, Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India ABSTARCT: As the information contained within

### Tracking and Recognition in Sports Videos

Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer

### Machine Learning for NLP

Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Automated Content Analysis of Discussion Transcripts

Automated Content Analysis of Discussion Transcripts Vitomir Kovanović v.kovanovic@ed.ac.uk Dragan Gašević dgasevic@acm.org School of Informatics, University of Edinburgh Edinburgh, United Kingdom v.kovanovic@ed.ac.uk

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 233-237 SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING K. SARULADHA 1 AND L. SASIREKA 2 1 Assistant Professor, Department of Computer Science and

### Guest Editors Introduction: Machine Learning in Speech and Language Technologies

Guest Editors Introduction: Machine Learning in Speech and Language Technologies Pascale Fung (pascale@ee.ust.hk) Department of Electrical and Electronic Engineering Hong Kong University of Science and

### Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

### Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### WEKA Explorer User Guide for Version 3-4-3

WEKA Explorer User Guide for Version 3-4-3 Richard Kirkby Eibe Frank November 9, 2004 c 2002, 2004 University of Waikato Contents 1 Launching WEKA 2 2 The WEKA Explorer 2 Section Tabs................................

### Towards Effective Recommendation of Social Data across Social Networking Sites

Towards Effective Recommendation of Social Data across Social Networking Sites Yuan Wang 1,JieZhang 2, and Julita Vassileva 1 1 Department of Computer Science, University of Saskatchewan, Canada {yuw193,jiv}@cs.usask.ca

### Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

### 8. Machine Learning Applied Artificial Intelligence

8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name

### A Perspective Analysis of Traffic Accident using Data Mining Techniques

A Perspective Analysis of Traffic Accident using Data Mining Techniques S.Krishnaveni Ph.D (CS) Research Scholar, Karpagam University, Coimbatore, India 641 021 Dr.M.Hemalatha Asst. Professor & Head, Dept