Connecting the dots between

Size: px
Start display at page:

Download "Connecting the dots between"

Transcription

1 Connecting the dots between Research Team: Carla Abreu, Jorge Teixeira, Prof. Eugénio Oliveira Domain: News Research Keywords: Natural Language Processing, Information Extraction, Machine Learning.

2 Objective " larger and larger amounts of news content is published every day. With this much data, it is often easy to miss the big picture. (Shahaf and Guestrin, 2010) Objective: Automatically aggregate similar news and build news chains (Shahaf and Guestrin, 2010): Connecting the Dots Between News Articles

3 How to do this? Similarity Keywords Extraction News group News group / Keywords Arch News chains

4 Similarity Aim: Clustering Similar News Challenges: What news data are important for the similarity process? How can we use that data? Which methods can we use in this process?how can we evaluate this process?

5 Similarity Filter: Revista de imprensa: destaques de "O Jogo" Jornais do dia Mourinho diz que os seus brasileiros jogaram muito bem. Quiseram embraçá-lo com os 6-2 da goleada sofrida por Portugal. Revista de imprensa: destaques do "Jornal de Notícias Jornais do dia Governo pressiona direcções das escolas. Ministério pondera avaliar conselhos executivos pelo sistema do sector público. Normalization: remove punctuation marks; remove patterns; remove stop-words (snowball); words stemming (ptstemmer)

6 Similarity Title News comparation: Similarity: Teaser Title - ST*; Teaser ( S) - STe*; Content - SC*. Temporary Window T Content * Values between 0 and 1

7 Similarity First Approach Similar Tree (manual threshold assignment; empirical values) Second Approach Classification methods (provide by scikit-learn; automatic approach) Decision Tree; Support Vector Classifier (SVC) SVC Linear Random Forest Gaussian

8 Similarity Features Title Similarity Teaser Similarity Content Similarity Variables: S = 0,2 T=1 Algoritm - Levensthein Stemmer - Porter Stemmer

9 Similarity Dataset 3 millions of Portuguese news published between 2008 and 2013 Training Set Select 100 news of each day (between 23 Dec 2012 and 22 Jan 2013) Annotate randomly 371 comparisons Test Set TS1: Select 501 distinct news from 19 Nov Annotate randomly 5101 comparisons TS2: Select 210 distinct news from 19 Nov Annotate randomly 1047 comparisons

10 Similarity Annotation Interface

11 Similarity Experimental Setup Precision (P) Recall(R) P= TP TP + FP Accuracy(A) R= TP TP + FN F measure (F) A= TP_+ TN TP + TN + FP+ FN True Positives (TP): number of similar news correctly identify; False Positives (FP): number of non similar news identified as similar; True Negatives (TN): number of non similar news correctly identify; False Negatives (FN): number of similar news identified as non similar. F = 2 * P * R P + R

12 Similarity Results and Analyses RandomForest: Random Behaviour P R A F DecisionTree 0,958 0,932 0,985 0,945 SVC 0,993 0,963 0,994 0,978 SVC Linear 0,991 0,963 0,994 0,977 RandomForest 0,987 0,960 0,993 0,974 Gaussian 0,701 0,964 0,956 0,812 Similar Tree 0,999 0,839 0,974 0,912 Gaussian: Worst Performance SVCs results are better than Decision Tree in all metrics SVCs have similar results SVC: Better combination of evaluation metrics

13 News Group

14 News Group News 2014 (3 April to 20 June) Number of news: Cluster number: Average amount of news per cluster: ~ 3,7 March 2014, Number of news: Number of news in news group: 8278

15 Keywords extraction Aim: Extract relevant terms from text. Challenges: Can any word be considered a keyword? Can a news be described by a simple word? a compound word? or an entity? How we can extract useful keywords from the news?

16 Keywords extraction Approach Explicit Keywords Simple (uni-grams) Governo rebeldes busca competição atentado à bomba avião da Malaysia Airlines fase de grupos Bagdade Malásia Rui Patrício Compound (n-grams) Tribunal Constitucional Implicit Keywords Entities Presidente República

17 Keywords extraction Explicit Keywords Pos Tagger (Pablo Gamallo) [n-grams] Normalization: Remove Patterns Stemmer [uni-grams] Term frequency - Inverse document frequency (TF-IDF): o(w, DOC): number of occurences of WORD in DOCUMENT; npalavras(doc): number of words in DOCUMENT docs(all): number of documents in the documents collection; docs(w, ALL): number of documents in the documents collection withc contain WORD

18 Keywords extraction Implicit Keywords Normalization Relation between words ( Ventura, Silva 2013) Corr(A,B) is based on Pearson s correlation coefficient; D is the number of documents of corpus D; di is the i-th document in D; size(di) is its number of words and f(a, di) the frequency of term A in di. Corr(A, B) ranges -1 (non correlation) to +1(strong correlation) (Ventura, Silva 2013): Automatic Extraction of Explicit and Implicit Keywords to Build Document Descriptors

19 Keywords extraction Entities Find Entities A idade média dos entrevistados era de 11 anos no início do estudo, sendo rapazes três quartos do total Os jovens que jogam jogos de vídeo têm mais propensão para pensar e agir de forma agressiva, indica um estudo feito a mais de estudantes em Singapura e hoje divulgado. O estudo, publicado pela revista da American Medical Association e baseado em três anos de trabalho com jovens, concluiu, com base nas respostas dos estudantes, que havia uma ligação entre o uso frequente de jogos de vídeo e as altas taxas de comportamentos e pensamentos agressivos.

20 Keywords extraction Dataset 4789 news articles from January to December (2012) Test set: select one day from each month of 2012 select three hours of each day extract keywords select 10 news from each day check manually the keywords

21 Keywords extraction Experimental Setup PalavrasChaveRepresentativas Number of words that represents the news PalavrasChaveAtribuídas Number of words attributed to news N number of news Results Evaluation Explicit - Simple 0,732 Explicit - Compound 0,762 Implicit ~0 Entity 0,804

22 News Group / Keywords Aim: associate keywords to newsgroups according their weight

23 Arch Aim: Connect groups of news Challenges: How can we aggregate news clusters? What fields need to be considered?

24 Arch Approach (explicit simple keywords, entities and personalities) Normalization lowercase explicit simple keywords - reduce words to their stem Find Personalities From entities and explicit compound keywords using Verbetes. Distance: ka number of words in news group a; kb number of words in news group b; Wkja: weigth world j in news group a; Wkib: weigth world i in news group b; D1 and D2: range from 0 to 1

25 Arch Approach (explicit compound keywords) Normalization lowercase remove stop-words All words have the same weigth Distance: Edit distance algorithm - qgrams - q=3

26 Arch Goldstandard 1408 news (2012, January) 131 groups of news Trainset: 5671 comparisons between groups of news 277 connections 5394 non connections Testset: 300 comparisons between groups of news 26 connections 247 non connections

27 Arch Experiences 1. 6 Experiences Metrics to calculate distance(d1 and D2) Experiences Constraints to comparisons - number of entities - number of personalities - similarity between explicit simple keywords

28 Arch Experimental Setup Precision (P) Recall(R) P= TP TP + FP True Positives (TP): number of connections correctly identify; False Positives (FP): number of non connections identified as connections; True Negatives (TN): number of non connections correctly identify; False Negatives (FN): number of connections identified as non connections. R= TP TP + FN

29 Arch Results and Analyses Experiences Metrics: a. b. c. Explicit simple keyword: D1 Personalities: D1 Entities: D2 Constrains: a. Entities >= 3 b. Explicit simple keyword similarity >= 0,2 Best Result Gaussian Precision 0,941 Recall 0,308

30 News Chains

31 Thanks! Carla Abreu Acknowledgement Bruno Tavares Connecting the dots between news

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,

More information

Naïve Bayesian Anti-spam Filtering Technique for Malay Language

Naïve Bayesian Anti-spam Filtering Technique for Malay Language Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

NADABAS. Report from a short term mission to the National Statistical Institute of Mozambique, Maputo Mozambique. 16-27 April 2012

NADABAS. Report from a short term mission to the National Statistical Institute of Mozambique, Maputo Mozambique. 16-27 April 2012 MZ:2012:04r NADABAS Report from a short term mission to the National Statistical Institute of Mozambique, Maputo Mozambique 16-27 April 2012 within the frame work of the AGREEMENT ON CONSULTING ON INSTITUTIONAL

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

1. Classification problems

1. Classification problems Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification

More information

HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST

HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST FLÁVIO ROSENDO DA SILVA OLIVEIRA 1 DIOGO FERREIRA PACHECO 2 FERNANDO BUARQUE DE LIMA NETO 3 ABSTRACT: This paper presents a hybrid approach

More information

REACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION

REACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION Workshop 2013.07.31 Overview Porto, FEUP Mário J. Silva IST/INESC-ID, Portugal Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Evaluation of a Segmental Durations Model for TTS

Evaluation of a Segmental Durations Model for TTS Speech NLP Session Evaluation of a Segmental Durations Model for TTS João Paulo Teixeira, Diamantino Freitas* Instituto Politécnico de Bragança *Faculdade de Engenharia da Universidade do Porto Overview

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Classification of Documents using Text Mining Package tm

Classification of Documents using Text Mining Package tm Classification of Documents using Text Mining Package tm Pavel Brazdil LIAAD - INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th

More information

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting Inf1-DA 2010 2011 III: 1 / 89 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part III Unstructured Data Ian Stark February 2011 Inf1-DA 2010 2011 III: 2 / 89 Part III Unstructured

More information

Content-Based Recommendation

Content-Based Recommendation Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

More information

13 melhores extensões Magento melhorar o SEO da sua loja

13 melhores extensões Magento melhorar o SEO da sua loja Lojas Online ou Lojas Virtuais Seleção das melhores lojas para comprar online em Portugal. Loja virtual designa uma página na Internet com um software de gerenciamento de pedidos (carrinho de compras)

More information

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS

More information

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets

Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets Disambiguating Implicit Temporal Queries by Clustering Top Ricardo Campos 1, 4, 6, Alípio Jorge 3, 4, Gaël Dias 2, 6, Célia Nunes 5, 6 1 Tomar Polytechnic Institute, Tomar, Portugal 2 HULTEC/GREYC, University

More information

THE BEHAVIOUR OF SENSIBLE HEAT TURBULENT FLUX IN SYNOPTIC DISTURBANCE

THE BEHAVIOUR OF SENSIBLE HEAT TURBULENT FLUX IN SYNOPTIC DISTURBANCE THE BEHAVIOUR OF SENSIBLE HEAT TURBULENT FLUX IN SYNOPTIC DISTURBANCE Flávia Dias RABELO¹, Amauri Pereira de OLIVEIRA, Mauricio Jonas FERREIRA Group of Micrometeorology, Department of Atmospheric Sciences,

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

Analysis of Tweets for Prediction of Indian Stock Markets

Analysis of Tweets for Prediction of Indian Stock Markets Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,

More information

ReLink: Recovering Links between Bugs and Changes

ReLink: Recovering Links between Bugs and Changes ReLink: Recovering Links between Bugs and Changes Rongxin Wu, Hongyu Zhang, Sunghun Kim and S.C. Cheung School of Software, Tsinghua University Beijing 100084, China wrx09@mails.tsinghua.edu.cn, hongyu@tsinghua.edu.cn

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

SUITABILITY OF RELATIVE HUMIDITY AS AN ESTIMATOR OF LEAF WETNESS DURATION

SUITABILITY OF RELATIVE HUMIDITY AS AN ESTIMATOR OF LEAF WETNESS DURATION SUITABILITY OF RELATIVE HUMIDITY AS AN ESTIMATOR OF LEAF WETNESS DURATION PAULO C. SENTELHAS 1, ANNA DALLA MARTA 2, SIMONE ORLANDINI 2, EDUARDO A. SANTOS 3, TERRY J. GILLESPIE 3, MARK L. GLEASON 4 1 Departamento

More information

A Web Content Mining Approach for Tag Cloud Generation

A Web Content Mining Approach for Tag Cloud Generation A Web Content Mining Approach for Tag Cloud Generation Muhammad Abulaish Center of Excellence in Information Assurance King Saud University, Riyadh, Saudi Arabia mabulaish@ksu.edu.sa Tarique Anwar Center

More information

External School Evaluation in Portugal a glance at the impacts on curricular and pedagogical practices

External School Evaluation in Portugal a glance at the impacts on curricular and pedagogical practices European Journal of Curriculum Studies, 2014 Vol. 1, No. 1, 33-43 External School Evaluation in Portugal a glance at the impacts on curricular and pedagogical practices Leite, Carlinda University of Oporto,

More information

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms

TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms Alexander Arturo Mera Caraballo 1, Narciso Moura Arruda Júnior 2, Bernardo Pereira Nunes 1, Giseli Rabello Lopes 1, Marco

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

IT services for analyses of various data samples

IT services for analyses of various data samples IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical

More information

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu

Twitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Twitter Stock Bot John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Hassaan Markhiani The University of Texas at Austin hassaan@cs.utexas.edu Abstract The stock market is influenced

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

E-discovery Taking Predictive Coding Out of the Black Box

E-discovery Taking Predictive Coding Out of the Black Box E-discovery Taking Predictive Coding Out of the Black Box Joseph H. Looby Senior Managing Director FTI TECHNOLOGY IN CASES OF COMMERCIAL LITIGATION, the process of discovery can place a huge burden on

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract

Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:

More information

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION MATIJA STEVANOVIC PhD Student JENS MYRUP PEDERSEN Associate Professor Department of Electronic Systems Aalborg University,

More information

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques. International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant

More information

Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia

Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia Martin Scaiano University of Ottawa mscai056@uottawa.ca Diana Inkpen University of Ottawa diana@site.uottawa.com Abstract

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Advanced Ensemble Strategies for Polynomial Models

Advanced Ensemble Strategies for Polynomial Models Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer

More information

Sentiment analysis on tweets in a financial domain

Sentiment analysis on tweets in a financial domain Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International

More information

A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data

A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data Final version of the accepted paper. Cite as: "M. Abulaish and T. Anwar, A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data, International Journal of Adaptive, Resilient and

More information

Private Record Linkage with Bloom Filters

Private Record Linkage with Bloom Filters To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,

More information

Datamining. Gabriel Bacq CNAMTS

Datamining. Gabriel Bacq CNAMTS Datamining Gabriel Bacq CNAMTS In a few words DCCRF uses two ways to detect fraud cases: one which is fully implemented and another one which is experimented: 1. Database queries (fully implemented) Example:

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

Tracking and Recognition in Sports Videos

Tracking and Recognition in Sports Videos Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer

More information

Online Ensembles for Financial Trading

Online Ensembles for Financial Trading Online Ensembles for Financial Trading Jorge Barbosa 1 and Luis Torgo 2 1 MADSAD/FEP, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal jorgebarbosa@iol.pt 2 LIACC-FEP, University of

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Task 3 Web Community Sensing & Task 6 Query and Visualization

Task 3 Web Community Sensing & Task 6 Query and Visualization Task 3 Web Community Sensing & Task 6 Query and Visualization REACTION Workshop January 31 th, 2013 Summary of on-going activities Team update WP3 & WP6 progress reports Resources & publications Team update

More information

8. Machine Learning Applied Artificial Intelligence

8. Machine Learning Applied Artificial Intelligence 8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name

More information

Automatic Text Processing: Cross-Lingual. Text Categorization

Automatic Text Processing: Cross-Lingual. Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo

More information

Three Methods for ediscovery Document Prioritization:

Three Methods for ediscovery Document Prioritization: Three Methods for ediscovery Document Prioritization: Comparing and Contrasting Keyword Search with Concept Based and Support Vector Based "Technology Assisted Review-Predictive Coding" Platforms Tom Groom,

More information

PEDRO SEQUEIRA ORGANIZAR O ATELIÊ ORGANISING THE STUDIO (pantone)

PEDRO SEQUEIRA ORGANIZAR O ATELIÊ ORGANISING THE STUDIO (pantone) PEDRO SEQUEIRA ORGANIZAR O ATELIÊ ORGANISING THE STUDIO (pantone) Parte do trabalho de Mestrado em Desenho e Técnicas de Impressão 2009-11, FBAUP, Porto PT Part of the work developed as part of the Master

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Linear programming approach for online advertising

Linear programming approach for online advertising Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

User Data Analytics and Recommender System for Discovery Engine

User Data Analytics and Recommender System for Discovery Engine User Data Analytics and Recommender System for Discovery Engine Yu Wang Master of Science Thesis Stockholm, Sweden 2013 TRITA- ICT- EX- 2013: 88 User Data Analytics and Recommender System for Discovery

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique

A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj

More information

II. RELATED WORK. Sentiment Mining

II. RELATED WORK. Sentiment Mining Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Car Insurance. Prvák, Tomi, Havri

Car Insurance. Prvák, Tomi, Havri Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká hladka@ufal.mff.cuni.cz

More information

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

More information

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015 Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015

More information

arxiv:1301.4944v1 [stat.ml] 21 Jan 2013

arxiv:1301.4944v1 [stat.ml] 21 Jan 2013 Evaluation of a Supervised Learning Approach for Stock Market Operations Marcelo S. Lauretto 1, Bárbara B. C. Silva 1 and Pablo M. Andrade 2 1 EACH USP, 2 IME USP. 1 Introduction arxiv:1301.4944v1 [stat.ml]

More information

Automated Severity Assessment of Software Defect Reports

Automated Severity Assessment of Software Defect Reports Automated Severity Assessment of Software Defect Reports Tim Menzies Lane Department of Computer Science, West Virginia University PO Box 6109, Morgantown, WV, 26506 304 293 0405 tim@menzies.us Abstract

More information

Date : July 28, 2015

Date : July 28, 2015 Date : July 28, 2015 Awesome(Team( 2! Who"are"we?" Menish Gupta Lukas Osborne Founder!&!CEO! 9+!years!@!Amex!! 5!years!@!Startups!in!NYC! B.S.!/!M.S.!Comp!Sci.!NJIT! Data!Science! 7!PublicaIons! 5!years!@!CISMM!Labs!

More information

Prova escrita de conhecimentos específicos de Inglês

Prova escrita de conhecimentos específicos de Inglês Provas Especialmente Adequadas Destinadas a Avaliar a Capacidade para a Frequência dos Cursos Superiores do Instituto Politécnico de Leiria dos Maiores de 23 Anos - 2012 Instruções gerais Prova escrita

More information

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein

More information

Keywords Phishing Attack, phishing Email, Fraud, Identity Theft

Keywords Phishing Attack, phishing Email, Fraud, Identity Theft Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection Phishing

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Technical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION

Technical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each max) 12:30 Break. 14:00 Technical Presentations 15:00 Break 16:00 Short Technical

More information

Dissecting the Learning Behaviors in Hacker Forums

Dissecting the Learning Behaviors in Hacker Forums Dissecting the Learning Behaviors in Hacker Forums Alex Tsang Xiong Zhang Wei Thoo Yue Department of Information Systems, City University of Hong Kong, Hong Kong inuki.zx@gmail.com, xionzhang3@student.cityu.edu.hk,

More information

VERBATIM Automatic Extraction of Quotes and Topics from News Feeds

VERBATIM Automatic Extraction of Quotes and Topics from News Feeds VERBATIM Automatic Extraction of Quotes and Topics from News Feeds Luis Sarmento e Sérgio Nunes 4th Doctoral Symposium on Informatics Engineering Porto, Portugal, on February 5 6, 2009. Verbatim: Motivation

More information

EuroRec Repository. Translation Manual. January 2012

EuroRec Repository. Translation Manual. January 2012 EuroRec Repository Translation Manual January 2012 Added to Deliverable D6.3 for the EHR-Q TN project EuroRec Repository Translations Manual January 2012 1/21 Table of Content 1 Property of the document...

More information

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

MULTIDIMENSÃO E TERRITÓRIOS DE RISCO

MULTIDIMENSÃO E TERRITÓRIOS DE RISCO MULTIDIMENSÃO E TERRITÓRIOS DE RISCO III Congresso Internacional I Simpósio Ibero-Americano VIII Encontro Nacional de Riscos Guimarães 2014 MULTIDIMENSÃO E TERRITÓRIOS DE RISCO III Congresso Internacional

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Clustering of Documents for Forensic Analysis

Clustering of Documents for Forensic Analysis Clustering of Documents for Forensic Analysis Asst. Prof. Mrs. Mugdha Kirkire #1, Stanley George #2,RanaYogeeta #3,Vivek Shukla #4, Kumari Pinky #5 #1 GHRCEM, Wagholi, Pune,9975101287. #2,GHRCEM, Wagholi,

More information

Learning Similarity Metrics for Event Identification in Social Media

Learning Similarity Metrics for Event Identification in Social Media Learning Similarity Metrics for Event Identification in Social Media Hila Becker Columbia University hila@cs.columbia.edu Mor Naaman Rutgers University mor@rutgers.edu Luis Gravano Columbia University

More information

GUIDELINES AND FORMAT SPECIFICATIONS FOR PROPOSALS, THESES, AND DISSERTATIONS

GUIDELINES AND FORMAT SPECIFICATIONS FOR PROPOSALS, THESES, AND DISSERTATIONS UNIVERSIDADE FEDERAL DE SANTA CATARINA CENTRO DE COMUNICAÇÃO E EXPRESSÃO PÓS-GRADUAÇÃO EM INGLÊS: ESTUDOS LINGUÍSTICOS E LITERÁRIOS GUIDELINES AND FORMAT SPECIFICATIONS FOR PROPOSALS, THESES, AND DISSERTATIONS

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information