Connecting the dots between
|
|
- Kristian Bridges
- 8 years ago
- Views:
Transcription
1 Connecting the dots between Research Team: Carla Abreu, Jorge Teixeira, Prof. Eugénio Oliveira Domain: News Research Keywords: Natural Language Processing, Information Extraction, Machine Learning.
2 Objective " larger and larger amounts of news content is published every day. With this much data, it is often easy to miss the big picture. (Shahaf and Guestrin, 2010) Objective: Automatically aggregate similar news and build news chains (Shahaf and Guestrin, 2010): Connecting the Dots Between News Articles
3 How to do this? Similarity Keywords Extraction News group News group / Keywords Arch News chains
4 Similarity Aim: Clustering Similar News Challenges: What news data are important for the similarity process? How can we use that data? Which methods can we use in this process?how can we evaluate this process?
5 Similarity Filter: Revista de imprensa: destaques de "O Jogo" Jornais do dia Mourinho diz que os seus brasileiros jogaram muito bem. Quiseram embraçá-lo com os 6-2 da goleada sofrida por Portugal. Revista de imprensa: destaques do "Jornal de Notícias Jornais do dia Governo pressiona direcções das escolas. Ministério pondera avaliar conselhos executivos pelo sistema do sector público. Normalization: remove punctuation marks; remove patterns; remove stop-words (snowball); words stemming (ptstemmer)
6 Similarity Title News comparation: Similarity: Teaser Title - ST*; Teaser ( S) - STe*; Content - SC*. Temporary Window T Content * Values between 0 and 1
7 Similarity First Approach Similar Tree (manual threshold assignment; empirical values) Second Approach Classification methods (provide by scikit-learn; automatic approach) Decision Tree; Support Vector Classifier (SVC) SVC Linear Random Forest Gaussian
8 Similarity Features Title Similarity Teaser Similarity Content Similarity Variables: S = 0,2 T=1 Algoritm - Levensthein Stemmer - Porter Stemmer
9 Similarity Dataset 3 millions of Portuguese news published between 2008 and 2013 Training Set Select 100 news of each day (between 23 Dec 2012 and 22 Jan 2013) Annotate randomly 371 comparisons Test Set TS1: Select 501 distinct news from 19 Nov Annotate randomly 5101 comparisons TS2: Select 210 distinct news from 19 Nov Annotate randomly 1047 comparisons
10 Similarity Annotation Interface
11 Similarity Experimental Setup Precision (P) Recall(R) P= TP TP + FP Accuracy(A) R= TP TP + FN F measure (F) A= TP_+ TN TP + TN + FP+ FN True Positives (TP): number of similar news correctly identify; False Positives (FP): number of non similar news identified as similar; True Negatives (TN): number of non similar news correctly identify; False Negatives (FN): number of similar news identified as non similar. F = 2 * P * R P + R
12 Similarity Results and Analyses RandomForest: Random Behaviour P R A F DecisionTree 0,958 0,932 0,985 0,945 SVC 0,993 0,963 0,994 0,978 SVC Linear 0,991 0,963 0,994 0,977 RandomForest 0,987 0,960 0,993 0,974 Gaussian 0,701 0,964 0,956 0,812 Similar Tree 0,999 0,839 0,974 0,912 Gaussian: Worst Performance SVCs results are better than Decision Tree in all metrics SVCs have similar results SVC: Better combination of evaluation metrics
13 News Group
14 News Group News 2014 (3 April to 20 June) Number of news: Cluster number: Average amount of news per cluster: ~ 3,7 March 2014, Number of news: Number of news in news group: 8278
15 Keywords extraction Aim: Extract relevant terms from text. Challenges: Can any word be considered a keyword? Can a news be described by a simple word? a compound word? or an entity? How we can extract useful keywords from the news?
16 Keywords extraction Approach Explicit Keywords Simple (uni-grams) Governo rebeldes busca competição atentado à bomba avião da Malaysia Airlines fase de grupos Bagdade Malásia Rui Patrício Compound (n-grams) Tribunal Constitucional Implicit Keywords Entities Presidente República
17 Keywords extraction Explicit Keywords Pos Tagger (Pablo Gamallo) [n-grams] Normalization: Remove Patterns Stemmer [uni-grams] Term frequency - Inverse document frequency (TF-IDF): o(w, DOC): number of occurences of WORD in DOCUMENT; npalavras(doc): number of words in DOCUMENT docs(all): number of documents in the documents collection; docs(w, ALL): number of documents in the documents collection withc contain WORD
18 Keywords extraction Implicit Keywords Normalization Relation between words ( Ventura, Silva 2013) Corr(A,B) is based on Pearson s correlation coefficient; D is the number of documents of corpus D; di is the i-th document in D; size(di) is its number of words and f(a, di) the frequency of term A in di. Corr(A, B) ranges -1 (non correlation) to +1(strong correlation) (Ventura, Silva 2013): Automatic Extraction of Explicit and Implicit Keywords to Build Document Descriptors
19 Keywords extraction Entities Find Entities A idade média dos entrevistados era de 11 anos no início do estudo, sendo rapazes três quartos do total Os jovens que jogam jogos de vídeo têm mais propensão para pensar e agir de forma agressiva, indica um estudo feito a mais de estudantes em Singapura e hoje divulgado. O estudo, publicado pela revista da American Medical Association e baseado em três anos de trabalho com jovens, concluiu, com base nas respostas dos estudantes, que havia uma ligação entre o uso frequente de jogos de vídeo e as altas taxas de comportamentos e pensamentos agressivos.
20 Keywords extraction Dataset 4789 news articles from January to December (2012) Test set: select one day from each month of 2012 select three hours of each day extract keywords select 10 news from each day check manually the keywords
21 Keywords extraction Experimental Setup PalavrasChaveRepresentativas Number of words that represents the news PalavrasChaveAtribuídas Number of words attributed to news N number of news Results Evaluation Explicit - Simple 0,732 Explicit - Compound 0,762 Implicit ~0 Entity 0,804
22 News Group / Keywords Aim: associate keywords to newsgroups according their weight
23 Arch Aim: Connect groups of news Challenges: How can we aggregate news clusters? What fields need to be considered?
24 Arch Approach (explicit simple keywords, entities and personalities) Normalization lowercase explicit simple keywords - reduce words to their stem Find Personalities From entities and explicit compound keywords using Verbetes. Distance: ka number of words in news group a; kb number of words in news group b; Wkja: weigth world j in news group a; Wkib: weigth world i in news group b; D1 and D2: range from 0 to 1
25 Arch Approach (explicit compound keywords) Normalization lowercase remove stop-words All words have the same weigth Distance: Edit distance algorithm - qgrams - q=3
26 Arch Goldstandard 1408 news (2012, January) 131 groups of news Trainset: 5671 comparisons between groups of news 277 connections 5394 non connections Testset: 300 comparisons between groups of news 26 connections 247 non connections
27 Arch Experiences 1. 6 Experiences Metrics to calculate distance(d1 and D2) Experiences Constraints to comparisons - number of entities - number of personalities - similarity between explicit simple keywords
28 Arch Experimental Setup Precision (P) Recall(R) P= TP TP + FP True Positives (TP): number of connections correctly identify; False Positives (FP): number of non connections identified as connections; True Negatives (TN): number of non connections correctly identify; False Negatives (FN): number of connections identified as non connections. R= TP TP + FN
29 Arch Results and Analyses Experiences Metrics: a. b. c. Explicit simple keyword: D1 Personalities: D1 Entities: D2 Constrains: a. Entities >= 3 b. Explicit simple keyword similarity >= 0,2 Best Result Gaussian Precision 0,941 Recall 0,308
30 News Chains
31 Thanks! Carla Abreu Acknowledgement Bruno Tavares Connecting the dots between news
Feature Subset Selection in E-mail Spam Detection
Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature
More informationMining a Corpus of Job Ads
Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department
More informationFRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
More informationNaïve Bayesian Anti-spam Filtering Technique for Malay Language
Naïve Bayesian Anti-spam Filtering Technique for Malay Language Thamarai Subramaniam 1, Hamid A. Jalab 2, Alaa Y. Taqa 3 1,2 Computer System and Technology Department, Faulty of Computer Science and Information
More informationOverview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
More informationNADABAS. Report from a short term mission to the National Statistical Institute of Mozambique, Maputo Mozambique. 16-27 April 2012
MZ:2012:04r NADABAS Report from a short term mission to the National Statistical Institute of Mozambique, Maputo Mozambique 16-27 April 2012 within the frame work of the AGREEMENT ON CONSULTING ON INSTITUTIONAL
More informationProjektgruppe. Categorization of text documents via classification
Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction
More information1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
More informationHYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST
HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST FLÁVIO ROSENDO DA SILVA OLIVEIRA 1 DIOGO FERREIRA PACHECO 2 FERNANDO BUARQUE DE LIMA NETO 3 ABSTRACT: This paper presents a hybrid approach
More informationREACTION Workshop 2013.07.31 Overview Porto, FEUP. Mário J. Silva IST/INESC-ID, Portugal REACTION
Workshop 2013.07.31 Overview Porto, FEUP Mário J. Silva IST/INESC-ID, Portugal Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationEvaluation of a Segmental Durations Model for TTS
Speech NLP Session Evaluation of a Segmental Durations Model for TTS João Paulo Teixeira, Diamantino Freitas* Instituto Politécnico de Bragança *Faculdade de Engenharia da Universidade do Porto Overview
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationStatistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationClassification of Documents using Text Mining Package tm
Classification of Documents using Text Mining Package tm Pavel Brazdil LIAAD - INESC Porto LA FEP, Univ. of Porto http://www.liaad.up.pt Escola de verão Aspectos de processamento da LN F. Letras, UP, 4th
More informationData and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting
Inf1-DA 2010 2011 III: 1 / 89 Informatics 1 School of Informatics, University of Edinburgh Data and Analysis Part III Unstructured Data Ian Stark February 2011 Inf1-DA 2010 2011 III: 2 / 89 Part III Unstructured
More informationContent-Based Recommendation
Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches
More information13 melhores extensões Magento melhorar o SEO da sua loja
Lojas Online ou Lojas Virtuais Seleção das melhores lojas para comprar online em Portugal. Loja virtual designa uma página na Internet com um software de gerenciamento de pedidos (carrinho de compras)
More informationMIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
More informationDisambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets
Disambiguating Implicit Temporal Queries by Clustering Top Ricardo Campos 1, 4, 6, Alípio Jorge 3, 4, Gaël Dias 2, 6, Célia Nunes 5, 6 1 Tomar Polytechnic Institute, Tomar, Portugal 2 HULTEC/GREYC, University
More informationTHE BEHAVIOUR OF SENSIBLE HEAT TURBULENT FLUX IN SYNOPTIC DISTURBANCE
THE BEHAVIOUR OF SENSIBLE HEAT TURBULENT FLUX IN SYNOPTIC DISTURBANCE Flávia Dias RABELO¹, Amauri Pereira de OLIVEIRA, Mauricio Jonas FERREIRA Group of Micrometeorology, Department of Atmospheric Sciences,
More informationA Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
More informationAnalysis of Tweets for Prediction of Indian Stock Markets
Analysis of Tweets for Prediction of Indian Stock Markets Phillip Tichaona Sumbureru Department of Computer Science and Engineering, JNTU College of Engineering Hyderabad, Kukatpally, Hyderabad-500 085,
More informationReLink: Recovering Links between Bugs and Changes
ReLink: Recovering Links between Bugs and Changes Rongxin Wu, Hongyu Zhang, Sunghun Kim and S.C. Cheung School of Software, Tsinghua University Beijing 100084, China wrx09@mails.tsinghua.edu.cn, hongyu@tsinghua.edu.cn
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationVCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
More informationSUITABILITY OF RELATIVE HUMIDITY AS AN ESTIMATOR OF LEAF WETNESS DURATION
SUITABILITY OF RELATIVE HUMIDITY AS AN ESTIMATOR OF LEAF WETNESS DURATION PAULO C. SENTELHAS 1, ANNA DALLA MARTA 2, SIMONE ORLANDINI 2, EDUARDO A. SANTOS 3, TERRY J. GILLESPIE 3, MARK L. GLEASON 4 1 Departamento
More informationA Web Content Mining Approach for Tag Cloud Generation
A Web Content Mining Approach for Tag Cloud Generation Muhammad Abulaish Center of Excellence in Information Assurance King Saud University, Riyadh, Saudi Arabia mabulaish@ksu.edu.sa Tarique Anwar Center
More informationExternal School Evaluation in Portugal a glance at the impacts on curricular and pedagogical practices
European Journal of Curriculum Studies, 2014 Vol. 1, No. 1, 33-43 External School Evaluation in Portugal a glance at the impacts on curricular and pedagogical practices Leite, Carlinda University of Oporto,
More informationTRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms
TRTML - A Tripleset Recommendation Tool based on Supervised Learning Algorithms Alexander Arturo Mera Caraballo 1, Narciso Moura Arruda Júnior 2, Bernardo Pereira Nunes 1, Giseli Rabello Lopes 1, Marco
More informationSearch and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
More informationIT services for analyses of various data samples
IT services for analyses of various data samples Ján Paralič, František Babič, Martin Sarnovský, Peter Butka, Cecília Havrilová, Miroslava Muchová, Michal Puheim, Martin Mikula, Gabriel Tutoky Technical
More informationTwitter Stock Bot. John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu
Twitter Stock Bot John Matthew Fong The University of Texas at Austin jmfong@cs.utexas.edu Hassaan Markhiani The University of Texas at Austin hassaan@cs.utexas.edu Abstract The stock market is influenced
More informationRecommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek
Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...
More informationTowards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial
More informationE-discovery Taking Predictive Coding Out of the Black Box
E-discovery Taking Predictive Coding Out of the Black Box Joseph H. Looby Senior Managing Director FTI TECHNOLOGY IN CASES OF COMMERCIAL LITIGATION, the process of discovery can place a huge burden on
More informationChapter 8. Final Results on Dutch Senseval-2 Test Data
Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised
More informationStatistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
More informationSentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation. Abstract
Sentiment Analysis on Twitter with Stock Price and Significant Keyword Correlation Linhao Zhang Department of Computer Science, The University of Texas at Austin (Dated: April 16, 2013) Abstract Though
More informationTerm extraction for user profiling: evaluation by the user
Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationData Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago sagarikaprusty@gmail.com Keywords:
More informationCYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION
CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION MATIJA STEVANOVIC PhD Student JENS MYRUP PEDERSEN Associate Professor Department of Electronic Systems Aalborg University,
More informationKeywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
More informationFinding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia
Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia Martin Scaiano University of Ottawa mscai056@uottawa.ca Diana Inkpen University of Ottawa diana@site.uottawa.com Abstract
More informationBlog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
More informationClustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationSentiment analysis on tweets in a financial domain
Sentiment analysis on tweets in a financial domain Jasmina Smailović 1,2, Miha Grčar 1, Martin Žnidaršič 1 1 Dept of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia 2 Jožef Stefan International
More informationA Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data
Final version of the accepted paper. Cite as: "M. Abulaish and T. Anwar, A Keyphrase-Based Tag Cloud Generation Framework to Conceptualize Textual Data, International Journal of Adaptive, Resilient and
More informationPrivate Record Linkage with Bloom Filters
To appear in: Proceedings of Statistics Canada Symposium 2010 Social Statistics: The Interplay among Censuses, Surveys and Administrative Data Private Record Linkage with Bloom Filters Rainer Schnell,
More informationDatamining. Gabriel Bacq CNAMTS
Datamining Gabriel Bacq CNAMTS In a few words DCCRF uses two ways to detect fraud cases: one which is fully implemented and another one which is experimented: 1. Database queries (fully implemented) Example:
More informationA Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical
More informationTracking and Recognition in Sports Videos
Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer
More informationOnline Ensembles for Financial Trading
Online Ensembles for Financial Trading Jorge Barbosa 1 and Luis Torgo 2 1 MADSAD/FEP, University of Porto, R. Dr. Roberto Frias, 4200-464 Porto, Portugal jorgebarbosa@iol.pt 2 LIACC-FEP, University of
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationTask 3 Web Community Sensing & Task 6 Query and Visualization
Task 3 Web Community Sensing & Task 6 Query and Visualization REACTION Workshop January 31 th, 2013 Summary of on-going activities Team update WP3 & WP6 progress reports Resources & publications Team update
More information8. Machine Learning Applied Artificial Intelligence
8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name
More informationAutomatic Text Processing: Cross-Lingual. Text Categorization
Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo
More informationThree Methods for ediscovery Document Prioritization:
Three Methods for ediscovery Document Prioritization: Comparing and Contrasting Keyword Search with Concept Based and Support Vector Based "Technology Assisted Review-Predictive Coding" Platforms Tom Groom,
More informationPEDRO SEQUEIRA ORGANIZAR O ATELIÊ ORGANISING THE STUDIO (pantone)
PEDRO SEQUEIRA ORGANIZAR O ATELIÊ ORGANISING THE STUDIO (pantone) Parte do trabalho de Mestrado em Desenho e Técnicas de Impressão 2009-11, FBAUP, Porto PT Part of the work developed as part of the Master
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationLinear programming approach for online advertising
Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,
More informationT-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationUser Data Analytics and Recommender System for Discovery Engine
User Data Analytics and Recommender System for Discovery Engine Yu Wang Master of Science Thesis Stockholm, Sweden 2013 TRITA- ICT- EX- 2013: 88 User Data Analytics and Recommender System for Discovery
More informationMicro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
More informationSearch Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
More informationA new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique
A new Approach for Intrusion Detection in Computer Networks Using Data Mining Technique Aida Parbaleh 1, Dr. Heirsh Soltanpanah 2* 1 Department of Computer Engineering, Islamic Azad University, Sanandaj
More informationII. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationCar Insurance. Prvák, Tomi, Havri
Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes
More informationSearch and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
More informationSelected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms
Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms ESSLLI 2015 Barcelona, Spain http://ufal.mff.cuni.cz/esslli2015 Barbora Hladká hladka@ufal.mff.cuni.cz
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationSentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015
Sentiment Analysis D. Skrepetos 1 1 Department of Computer Science University of Waterloo NLP Presenation, 06/17/2015 D. Skrepetos (University of Waterloo) Sentiment Analysis NLP Presenation, 06/17/2015
More informationarxiv:1301.4944v1 [stat.ml] 21 Jan 2013
Evaluation of a Supervised Learning Approach for Stock Market Operations Marcelo S. Lauretto 1, Bárbara B. C. Silva 1 and Pablo M. Andrade 2 1 EACH USP, 2 IME USP. 1 Introduction arxiv:1301.4944v1 [stat.ml]
More informationAutomated Severity Assessment of Software Defect Reports
Automated Severity Assessment of Software Defect Reports Tim Menzies Lane Department of Computer Science, West Virginia University PO Box 6109, Morgantown, WV, 26506 304 293 0405 tim@menzies.us Abstract
More informationDate : July 28, 2015
Date : July 28, 2015 Awesome(Team( 2! Who"are"we?" Menish Gupta Lukas Osborne Founder!&!CEO! 9+!years!@!Amex!! 5!years!@!Startups!in!NYC! B.S.!/!M.S.!Comp!Sci.!NJIT! Data!Science! 7!PublicaIons! 5!years!@!CISMM!Labs!
More informationProva escrita de conhecimentos específicos de Inglês
Provas Especialmente Adequadas Destinadas a Avaliar a Capacidade para a Frequência dos Cursos Superiores do Instituto Politécnico de Leiria dos Maiores de 23 Anos - 2012 Instruções gerais Prova escrita
More informationA Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries
A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries Aida Mustapha *1, Farhana M. Fadzil #2 * Faculty of Computer Science and Information Technology, Universiti Tun Hussein
More informationKeywords Phishing Attack, phishing Email, Fraud, Identity Theft
Volume 3, Issue 7, July 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Detection Phishing
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More informationTechnical Presentations. Arian Pasquali, FEUP, REACTION Data Collection Plataform David Batista, INESC-ID, Sematic Relations Extraction REACTION
Agenda 11:30 Welcome + Quick progress report and status summary 11:45 Task leaders summarize ongoing activities (10 min each max) 12:30 Break. 14:00 Technical Presentations 15:00 Break 16:00 Short Technical
More informationDissecting the Learning Behaviors in Hacker Forums
Dissecting the Learning Behaviors in Hacker Forums Alex Tsang Xiong Zhang Wei Thoo Yue Department of Information Systems, City University of Hong Kong, Hong Kong inuki.zx@gmail.com, xionzhang3@student.cityu.edu.hk,
More informationVERBATIM Automatic Extraction of Quotes and Topics from News Feeds
VERBATIM Automatic Extraction of Quotes and Topics from News Feeds Luis Sarmento e Sérgio Nunes 4th Doctoral Symposium on Informatics Engineering Porto, Portugal, on February 5 6, 2009. Verbatim: Motivation
More informationEuroRec Repository. Translation Manual. January 2012
EuroRec Repository Translation Manual January 2012 Added to Deliverable D6.3 for the EHR-Q TN project EuroRec Repository Translations Manual January 2012 1/21 Table of Content 1 Property of the document...
More informationWinning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering
IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons
More informationSVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
More informationMULTIDIMENSÃO E TERRITÓRIOS DE RISCO
MULTIDIMENSÃO E TERRITÓRIOS DE RISCO III Congresso Internacional I Simpósio Ibero-Americano VIII Encontro Nacional de Riscos Guimarães 2014 MULTIDIMENSÃO E TERRITÓRIOS DE RISCO III Congresso Internacional
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationClustering of Documents for Forensic Analysis
Clustering of Documents for Forensic Analysis Asst. Prof. Mrs. Mugdha Kirkire #1, Stanley George #2,RanaYogeeta #3,Vivek Shukla #4, Kumari Pinky #5 #1 GHRCEM, Wagholi, Pune,9975101287. #2,GHRCEM, Wagholi,
More informationLearning Similarity Metrics for Event Identification in Social Media
Learning Similarity Metrics for Event Identification in Social Media Hila Becker Columbia University hila@cs.columbia.edu Mor Naaman Rutgers University mor@rutgers.edu Luis Gravano Columbia University
More informationGUIDELINES AND FORMAT SPECIFICATIONS FOR PROPOSALS, THESES, AND DISSERTATIONS
UNIVERSIDADE FEDERAL DE SANTA CATARINA CENTRO DE COMUNICAÇÃO E EXPRESSÃO PÓS-GRADUAÇÃO EM INGLÊS: ESTUDOS LINGUÍSTICOS E LITERÁRIOS GUIDELINES AND FORMAT SPECIFICATIONS FOR PROPOSALS, THESES, AND DISSERTATIONS
More informationArtificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
More information