Classification of Documents using Text Mining Package tm
|
|
|
- Kenneth Evans
- 10 years ago
- Views:
Transcription
1 Classification of Documents using Text Mining Package tm Pavel Brazdil LIAAD - INESC Porto LA FEP, Univ. of Porto Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th June 2009 Overview 1.Introduction 2.Preprocessing document collection in tm 2.1 The dataset 20Newsgroups 2.2 Creating a directory with a corpus for 2 subgroups 2.3 Preprocessing operations 2.4 Creating Document-Term matrix 2.5 Converting the Matrix into Data Frame 2.6 Including class information 3. Classification of documents 3.1 Using a knn classifier 3.2 Using a Decision Tree classifier 3.3 Using a Neural Net 2
2 1. Introduction Package tm of R permits to process text documents in an effective manner The work with this package can be initiated using a command > library(tm) It permits to: - Create a corpus a collection of text documents - Provide various preprocessing operations - Create a Document-Term matrix - Inspect / manipulate the Document-Term matrix (e.g. convert into a data frame needed by classifiers) - Train a classifier on pre-classified Document-Term data frame - Apply the trained classifier on new text documents to obtain class predictions and evaluate performance 3 2 Classification of documents 2.1 The dataset 20Newsgroups This data is available from There are two directores: - 20news-bydate-train (for training a classifier) - 20news-bydate-test (for applying a classifier / testing) Each contains 20 directories, each containing the text documents belonging to one newsgroup The data can be copied to a PC you are working on (it is more convenient) 4
3 20Newsgroups Subgroup comp comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x Subgroup misc misc.forsale Subgroup rec rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey Subgroup sci sci.crypt sci.electronics sci.med sci.space Subgroup talk.politics talk.politics.guns talk.politics.mideast talk.politics.misc <- chosen here Subgroup religion talk.religion.misc <- chosen here alt.atheism soc.religion.christian Creating a Corpus This involves : - Selecting one of the newsgroups (e.g.sci.electronics) - Invoking Change directory in R to 20news-bydate-train - Invoking the instruction Corpus(): sci.electr.train <- Corpus( DirSource ( sci.electronics ), readercontrol=list(reader=readnewsgroup, language= en_us ) ) If we type : > sci.electr.train or length(sci.electr.train) the system responds A corpus with 591 documents Similarly, we can obtain documents from another class (e.g. talk.religion.misc): talk.religion.train (377 documents) Similarly, we can obtain the test data: sci.electr.test (393 documents) talk.religion.test (251 documents) 6
4 Example of one document sci.electr.train[[1]] An object of class NewsgroupDocument [1] "In article writes:" [2] ">[...]" [3] ">There are a variety of water-proof housings I could use but the real meat" [4] ">of the problem is the electronics...hence this posting. What kind of" [5] ">transmission would be reliable underwater, in murky or even night-time" [6] ">conditions? I'm not sure if sound is feasible given the distortion under-" [7] ">water...obviously direction would have to be accurate but range could be" [8] ">relatively short (I imagine 2 or 3 hundred yards would be more than enough)" [9] ">" [10] ">Jim McDonald" [35] " ET \"Tesla was 100 years ahead of his time. Perhaps now his time comes.\"" [36] "----" 7 Metadata accompanying the document Slot "Newsgroup": character(0) Slot "URI": file("sci.electronics/52434", encoding = "UTF-8") Slot "Cached": [1] TRUE Slot "Author": [1] "[email protected] (Eric H. Taylor)" Slot "DateTimeStamp": character(0) Slot "Description": [1] "" Slot "ID": [1] "1" Slot "Origin": character(0) Slot "Heading": [1] "Re: HELP_WITH_TRACKING_DEVICE" Slot "Language": [1] "en_us" Slot "LocalMetaData": list() See Annex 1 for more details on metadata (slots) etc. 8
5 2.3 Preprocessing The Objective of Preprocessing: Documents are normally represented using words, terms or concepts. Considering all possible words as potential indicators of a class can create problems in training a given classifier. It is desirable to avoid building a classifier using dependencies based on too few cases (spurious regularities). The aim of preprocessing is to help to do this. The function tmmap (available in tm ) can be used to carry out various preprocessing steps. The operation is applied to the whole corpus (there is not need to program this using a loop). 9 Preprocessing using tmmap The format of this function is as follows: tmmap(corpus, Function) The second argument Function determines what is to be done: asplain removes XML from the document, removesignature removes the author of the message, removewords, stopwords(language= english ) removes stopwords for the language specified stripwhitespace removes extra spaces, tmtolower transforms all upper case letters to lower case, removepunctuation removes punctuation symbols, removenumbers removes numbers, Example of use: > sci.electr.train <- tmmap(sci.electr.train, asplain) > sci.electr.train <- tmmap (sci.electr.train, removesignature) etc. This can be repeated for the other 3 collections of documents 10
6 Merging document collections Instead of repeating this for all 4 documents collections we can merge the four document collections and perform the preprocessing on the resulting large collection only once. This can be done using the function c(): > sci.rel.tr.ts <- c(sci.electr.train, talk.religion.train, sci.electr.test, talk.religion.test) > sci.rel.tr.ts A text document collection with 1612 text documents. We need to remember the indices of each document sub-collection to be able to separate the document collections later. sci.electr.train documents talk.religion.train documents (377 docs) sci.electr.test documents (393 docs) talk.religion.test documents (251 docs) One single collection is important for the next step (document-term matrix) 11 Preprocessing the entire document collection > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts, asplain) > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts.p, removesignature) > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts.p, removewords, stopwords(language="english")) > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts.p, stripwhitespace) > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts.p, tmtolower) > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts.p, removepunctuation) > sci.rel.tr.ts.p <- tmmap(sci.rel.tr.ts.p, removenumbers) 12
7 Result of preprocessing of one document Original document: > sci.rel.tr.ts[[1]] (=sci.electr.train[[1]]) An object of class NewsgroupDocument [1] "In article writes:" [2] ">[...]" [3] ">There are a variety of water-proof housings I could use but the real meat" [4] ">of the problem is the electronics...hence this posting. What kind of" [5] ">transmission would be reliable underwater, in murky or even night-time" [6] ">conditions? I'm not sure if sound is feasible given the distortion under-" Pre-processed document: undesirable > sci.rel.tr.ts.p[[1]] [1] in article fbaeffaesoprutgersedu mcdonaldaesoprutgersedu writes [2] [3] there variety waterproof housings real meat [4] of electronicshence posting [5] transmission reliable underwater murky nighttime [6] conditions sound feasible distortion under Creating Document-Term Matrix (DTM) Existing classifiers that exploit propositional representation, (such as knn, NaiveBayes, SVM etc.) require that data be represented in the form of a table, where: each row contains one case (here a document), each column represents a particular atribute / feature (here a word). The function DocumentTermMatrix(..) can be used to create such a table. The format of this function is: DocumentTermMatrix(<DocCollection>, control=list(<options>)) Simple Example: > DocumentTermMatrix(sci.rel.tr.ts.p) Note: This command is available only in R Version In R Version the function available is TermDocMatrix(<DocCollection>) 14
8 Creating Document-Term Matrix (DTM) Simple Example: > DocumentTermMatrix(sci.rel.tr.ts.p) A document-term matrix (1612 documents, terms) Non-/sparse entries: / Sparsity : 100% Maximal term length: 135 Weighting : term frequency (tf) problematic 15 Options of DTM Most important options of DTM: weighting=tfidf minwordlength=wl mindocfreq=nd weighting is Tf-Idf the minimum word length is WL each word must appear at least in ND docs Other options of DTM These are not really needed, if preprocessing has been carried out: stemming = TRUE stemming is applied stopwords=true stopwords are eliminated removenumbers=true numers are eliminated Improved example: > dtm.mx.sci.rel <- DocumentTermMatrix( sci.rel.tr.ts.p, control=list(weighting=weighttfidf, minwordlength=2, mindocfreq=5)) 16
9 Generating DTM with different options > dtm.mx.sci.rel <- DocumentTermMatrix( sci.rel.tr.ts.p, control=list(minwordlength=2, mindocfreq=5)) > dtm.mx.sci.rel A document-term matrix (1612 documents, 1169 terms) Non-/sparse entries: 2.237/ Sparsity : 100% Maximal term length: 26 Weighting : term frequency (tf) better > dtm.mx.sci.rel.tfidf <- DocumentTermMatrix( sci.rel.tr.ts.p, control=list(weighing=weighttfidf, minwordlength=2, mindocfreq=5)) > dtm.mx.sci.rel.tfidf A document-term matrix (1612 documents, 1169 terms) Non-/sparse entries: 2.237/ Sparsity : 100% Maximal term length: 26 Weighting : term frequency - inverse document frequency (tf-idf) 17 Inspecting the DTM Function dim(dtm) permits to obtain the dimensions of the DTM matrix. Ex. > dim(dtm.mx.sci.rel) Inspecting some of the column names: (ex. 10 columns / words starting with column / word 101) > colnames(dtm.mx.sci.rel) [101:110] [1] "blinker" "blood" "blue" "board" "boards" "body" "bonding" "book" "books" [10] "born" 18
10 Inspecting the DTM Inspecting a part of the DTM matrix: (ex. the first 10 documents and 20 columns) > inspect( dtm.mx.sci.rel )[1:10,101:110] Docs blinker blood blue board boards body bonding book books born As we can see, the matrix is very sparse. By chance there are no values other than 0s. Note: The DTM is not and ordinary matrix, as it exploits object-oriented representation (includes meta-data). The function inspect(..) converts this into an ordinary matrix which can be inspected. 19 Finding Frequent Terms The function findfreqterms(dtm, ND) permits to find all the terms that appear in at least ND documents. Ex. > freqterms100 <- findfreqterms( dtm.mx.sci.rel, 100) > freqterms100 [1] "wire" "elohim" "god" "jehovah" "lord" From talk.religion > freqterms40 <- findfreqterms(dtm.mx.sci.rel, 40) > freqterms40 [1] "cable" "circuit" "ground" "neutral" "outlets" "subject" "wire" "wiring" [9] "judas" "ra" "christ" "elohim" "father" "god" "gods" "jehovah" [17] "jesus" "lord" "mcconkie" "ps" "son" "unto" > freqterms10 <- findfreqterms(dtm.mx.sci.rel, 10) > length(freqterms10) [1]
11 2.5 Converting TDM into a Data Frame Existing classifiers in R (such as knn, NaiveBayes, SVM etc.) require that data be represented in the form of a data frame (particular representation of tables). So, we need to convert DT matrix into a DT data frame: > dtm.sci.rel <- as.data.frame(inspect( dtm.mx.sci.rel )) > rownames(dtm.sci.rel)<- 1:nrow(dtm.mx.sci.rel) > dtm.sci.rel$wire[180:200] [1] > dtm.sci.rel$god[180:200] [1] sci.electr portion talk.religion portion > dtm.sci.rel$wire[( ):( )] [1] > dtm.sci.rel$god[( ):( )] [1] sci.electr portion talk.religion portion 21 Converting TDM into a Data Frame We repeat this also for the tf.idf version: So, we need to convert DT matrix into a DT data frame: > dtm.sci.rel.tfidf <- as.data.frame(inspect(dtm.mx.sci.rel.tfidf)) sci.electr portion > round(dtm.sci.rel.tfidf$wire[180:200],1) [1] > round(dtm.sci.rel.tfidf$god[180:200],1) talk.religion portion [1] > round(dtm.sci.rel.tfidf $wire[( ):( )],1) [1] > round(dtm.sci.rel.tfidf$god[( ):( )],1) sci.electr portion talk.religion portion 22
12 2.6 Appending class information This includes two steps: - Generate a vector with class information, - Append the vector as the last column to the data frame. Step 1. Generate a vector with class values (e.g. sci, rel ) We know that (see slide 6): sci.electr.train 591 docs talk.religion.train 377 docs sci.electr.test 393 docs talk.religion.test docs So: > class <- c(rep( sci,591), rep( rel,377), rep( sci,393), rep( rel,251)) > class.tr <- c(rep("sci",591), rep("rel",377)) > class.ts <- c(rep("sci",393), rep("rel",251)) Step2. Append the class vector as the last column to the data frame > dtm.sci.rel <- cbind( dtm.sci.rel, class) > ncol(dtm.sci.rel) [1] 1170 (the number of columns has increased by 1) 23 3 Classification of Documents 3.1 Using a KNN classifier Preparing the Data The classifier knn in R requires that the training data and test data have the same size. So: sci.electr.train 591 docs : Use the first 393 docs talk.religion.train 377 docs: Use the first 251 docs sci.electr.test 393 docs: Use all talk.religion.test docs: Use all > l1 <- length(sci.electron.train) (591 docs) > l2 <- length(talk.religion.train) (377 docs) > l3 <- length(sci.electron.test) (393 docs) > l4 <- length(talk.religion.test) (251 docs) > m1 <- min(l1,l3) (393 docs) > m2 <- min(l2,l4) (251 docs) > m3 <- min(l1,l3) (393 docs) > m4 <- min(l2,l4) (251 docs) 24
13 Preparing the training data Generating the training data: Calculate the last column of data (excluding the class): > last.col <- ncol( dtm.sci.rel )-1 Generate the first part of the training data with lines of sci.electr.train > sci.rel.tr <- dtm.sci.rel[1: m1, 1:last.col] Generate a vector with the class values: > class.tr <- dtm.sci.rel[1: m1, last.col+1] Extend the training data by adding lines from talk.religion.train > sci.rel.k.tr[(m1+1):(m1+m2),] <- dtm.sci.rel[(l1+1):(l1+m2),1:last.col] > nrow(sci.rel.k.tr) 644 Add class values at the end of the class vector: > class.k.tr[(m1+1):(m1+m2)] <- dtm.sci.rel[(l1+1):(l1+m2), last.col+1] > length(class.tr) Preparing the test data Generate the test data using the appropriate lines of sci.electr.test and talk.religion.test > sci.rel.k.ts <- dtm.sci.rel[(l1+l2+1):(l1+l2+m3+m4),1:last.col] > nrow(sci.rel.k.ts) 644 > class.k.ts <- dtm.sci.rel[(l1+l2+1):(l1+l2+m3+m4),last.col+1] 26
14 3.2 Classification of Docs using a Decision Tree Here we will use the same training / test data as in > library(rpart) Here we will use just 20 frequent terms as attributes > freqterms40 [1] "cable" "circuit" "ground" "neutral" "outlets" "subject" "wire" [8] "wiring" "judas" "ra" "christ" "elohim" "father" "god" [15] "gods" "jehovah" "jesus" "lord" "mcconkie" "ps" "son" [22] "unto" > dt <- rpart(class ~ cable + circuit + ground + neutral + outlets + subject + wire + wiring + judas + ra + christ + elohim + father + god + gods + jehovah + jesus + lord + mcconkie + ps + son + unto, sci.rel.tr) 27 Inspecting the Decision Tree > dt n= 968 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root sci ( ) 2) god>= rel ( ) * 3) god< sci ( ) 6) jesus>= rel ( ) * 7) jesus< sci ( ) * 28
15 Evaluating the Decision Tree Decision Tree Test data > dt.predictions.ts <- predict(dt, sci.rel.ts, type="class") > table(class.ts, dt.predictions.ts) dt.predictions.ts class.ts rel sci rel sci Classification of Documents using a Neural Net Acknowledgements: Rui Pedrosa, M.Sc. Student, M. ADSAD, 2009 > nnet.classifier <- nnet(class ~., data=sci.rel.tr, size=2, rang=0.1, decay=5e-4, maxit=200) > predictions <- predict(nnet.classifier, sci.rel.ts, type= class ) The errors reported on a similar task were quite good - about 17% 30
16 5. Calculating the evaluation measures The basis for all calculations is the confusion matrix, such as: > conf.mx <- table(class.ts, predictions.ts) > conf.mx class.ts rel sci rel sci > error.rate <- (sum(conf.mx) diag(conf.mx)) / sum(conf.mx) > tp <- conf.mx[1,1] (true positives) > fp <- conf.mx[2,1] (false positives) > tn <- conf.mx[2,2] (true negatives) > fn <- conf.mx[1,2] (false negatives) > error.rate <- (tp + tn) / (tp + tn + fp + fn) 31 Evaluation measures Recall = TP / (TP + FN) * 100% + - +^ TP FP -^ FN TN Precision = TP / (TP + FP) * 100% + +^ TP -^ FN - FP TN > recall = tp / (tp + fn) > precision = tp / (tp + fp) > f1 = 2 * precision * recall / (precision + recall) 32
Active Learning for Multi-Class Logistic Regression
Active Learning for Multi-Class Logistic Regression Andrew I. Schein and Lyle H. Ungar The University of Pennsylvania Department of Computer and Information Science 3330 Walnut Street Philadelphia, PA
Text Mining with R Twitter Data Analysis 1
Text Mining with R Twitter Data Analysis 1 Yanchang Zhao http://www.rdatamining.com R and Data Mining Workshop for the Master of Business Analytics course, Deakin University, Melbourne 28 May 2015 1 Presented
Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents
Document Clustering hrough Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational ime Reduction of Large Scale Documents Bishnu Prasad Gautam, Dipesh Shrestha, Members IAENG 1 Abstract
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
How To Write A Blog Post In R
R and Data Mining: Examples and Case Studies 1 Yanchang Zhao [email protected] http://www.rdatamining.com April 26, 2013 1 2012-2013 Yanchang Zhao. Published by Elsevier in December 2012. All rights
Data Mining with R. Text Mining. Hugh Murrell
Data Mining with R Text Mining Hugh Murrell reference books These slides are based on a book by Yanchang Zhao: R and Data Mining: Examples and Case Studies. http://www.rdatamining.com for further background
Mining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
Introduction to basic Text Mining in R.
Introduction to basic Text Mining in R. As published in Benchmarks RSS Matters, January 2014 http://web3.unt.edu/benchmarks/issues/2014/01/rss-matters Jon Starkweather, PhD 1 Jon Starkweather, PhD [email protected]
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
TEXT MINING WITH LUCENE AND HADOOP: DOCUMENT CLUSTERING WITH FEATURE EXTRACTION DIPESH SHRESTHA A THESIS SUBMITTED FOR THE FULFILLMENT RESEARCH DEGREE
TEXT MINING WITH LUCENE AND HADOOP: DOCUMENT CLUSTERING WITH FEATURE EXTRACTION BY DIPESH SHRESTHA A THESIS SUBMITTED FOR THE FULFILLMENT OF RESEARCH DEGREE TO WAKHOK UNIVERSITY 2009 ABSTRACT In this
Distributed Text Mining with tm
Distributed Text Mining with tm Stefan Theußl 1 Ingo Feinerer 2 Kurt Hornik 1 Institute for Statistics and Mathematics, WU Vienna 1 Institute of Information Systems, DBAI Group Technische Universität Wien
Projektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
Statistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015
W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
Text Mining Handbook
Louise Francis, FCAS, MAAA, and Matt Flynn, PhD Abstract Motivation. Provide a guide to open source tools that can be used as a reference to do text mining Method. We apply the text processing language
Content-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
Statistical text mining using R. Tom Liptrot The Christie Hospital
Statistical text mining using R Tom Liptrot The Christie Hospital Motivation Example 1: Example 2: Dickens to matrix Electronic patient records Dickens to Matrix: a bag of words IT WAS the best of times,
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Connecting the dots between
Connecting the dots between Research Team: Carla Abreu, Jorge Teixeira, Prof. Eugénio Oliveira Domain: News Research Keywords: Natural Language Processing, Information Extraction, Machine Learning. Objective
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India [email protected] 2 School of Computer
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Detecting E-mail Spam Using Spam Word Associations
Detecting E-mail Spam Using Spam Word Associations N.S. Kumar 1, D.P. Rana 2, R.G.Mehta 3 Sardar Vallabhbhai National Institute of Technology, Surat, India 1 [email protected] 2 [email protected]
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Text Analytics Illustrated with a Simple Data Set
CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
Hands-On Data Science with R Text Mining. [email protected]. 5th November 2014 DRAFT
Hands-On Data Science with R Text Mining [email protected] 5th November 2014 Visit http://handsondatascience.com/ for more Chapters. Text Mining or Text Analytics applies analytic tools to learn
Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers
Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers Manjit Kaur Department of Computer Science Punjabi University Patiala, India [email protected] Dr. Kawaljeet Singh University
A Primer on Text Analytics
A Primer on Text Analytics From the Perspective of a Statistics Graduate Student Bradley Turnbull Statistical Learning Group North Carolina State University March 27, 2015 Bradley Turnbull Text Analytics
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System
COC131 Data Mining - Clustering
COC131 Data Mining - Clustering Martin D. Sykora [email protected] Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window
TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt
TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Application of Linear Algebra in. Electrical Circuits
Application of Linear Algebra in Electrical Circuits Seamleng Taing Math 308 Autumn 2001 December 2, 2001 Table of Contents Abstract..3 Applications of Linear Algebra in Electrical Circuits Explanation..
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Analysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
6. Cholesky factorization
6. Cholesky factorization EE103 (Fall 2011-12) triangular matrices forward and backward substitution the Cholesky factorization solving Ax = b with A positive definite inverse of a positive definite matrix
Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data
Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 2 nd, 2014 Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
Clustering through Decision Tree Construction in Geology
Nonlinear Analysis: Modelling and Control, 2001, v. 6, No. 2, 29-41 Clustering through Decision Tree Construction in Geology Received: 22.10.2001 Accepted: 31.10.2001 A. Juozapavičius, V. Rapševičius Faculty
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
USING MAGENTO TRANSLATION TOOLS
Magento Translation Tools 1 USING MAGENTO TRANSLATION TOOLS Magento translation tools make the implementation of multi-language Magento stores significantly easier. They allow you to fetch all translatable
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Levent EREN [email protected] A-306 Office Phone:488-9882 INTRODUCTION TO DIGITAL LOGIC
Levent EREN [email protected] A-306 Office Phone:488-9882 1 Number Systems Representation Positive radix, positional number systems A number with radix r is represented by a string of digits: A n
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-17-AUC
1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
Operating Systems. and Windows
Operating Systems and Windows What is an Operating System? The most important program that runs on your computer. It manages all other programs on the machine. Every PC has to have one to run other applications
Practical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Lowering social cost of car accidents by predicting high-risk drivers
Lowering social cost of car accidents by predicting high-risk drivers Vannessa Peng Davin Tsai Shu-Min Yeh Why we do this? Traffic accident happened every day. In order to decrease the number of traffic
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
Web Forensic Evidence of SQL Injection Analysis
International Journal of Science and Engineering Vol.5 No.1(2015):157-162 157 Web Forensic Evidence of SQL Injection Analysis 針 對 SQL Injection 攻 擊 鑑 識 之 分 析 Chinyang Henry Tseng 1 National Taipei University
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
2+2 Just type and press enter and the answer comes up ans = 4
Demonstration Red text = commands entered in the command window Black text = Matlab responses Blue text = comments 2+2 Just type and press enter and the answer comes up 4 sin(4)^2.5728 The elementary functions
Characterizing and Predicting Blocking Bugs in Open Source Projects
Characterizing and Predicting Blocking Bugs in Open Source Projects Harold Valdivia Garcia and Emad Shihab Department of Software Engineering Rochester Institute of Technology Rochester, NY, USA {hv1710,
Impact of Feature Selection Technique on Email Classification
Impact of Feature Selection Technique on Email Classification Aakanksha Sharaff, Naresh Kumar Nagwani, and Kunal Swami Abstract Being one of the most powerful and fastest way of communication, the popularity
IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION
http:// IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION Harinder Kaur 1, Raveen Bajwa 2 1 PG Student., CSE., Baba Banda Singh Bahadur Engg. College, Fatehgarh Sahib, (India) 2 Asstt. Prof.,
Automated News Item Categorization
Automated News Item Categorization Hrvoje Bacan, Igor S. Pandzic* Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia {Hrvoje.Bacan,Igor.Pandzic}@fer.hr
Databases in Organizations
The following is an excerpt from a draft chapter of a new enterprise architecture text book that is currently under development entitled Enterprise Architecture: Principles and Practice by Brian Cameron
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Performance Metrics. number of mistakes total number of observations. err = p.1/1
p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
Private Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Data Mining is the process of knowledge discovery involving finding
using analytic services data mining framework for classification predicting the enrollment of students at a university a case study Data Mining is the process of knowledge discovery involving finding hidden
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
