Connecting the dots between

Size: px

Start display at page:

Download "Connecting the dots between"

Kristian Bridges
8 years ago
Views:

1 Connecting the dots between Research Team: Carla Abreu, Jorge Teixeira, Prof. Eugénio Oliveira Domain: News Research Keywords: Natural Language Processing, Information Extraction, Machine Learning.

2 Objective " larger and larger amounts of news content is published every day. With this much data, it is often easy to miss the big picture. (Shahaf and Guestrin, 2010) Objective: Automatically aggregate similar news and build news chains (Shahaf and Guestrin, 2010): Connecting the Dots Between News Articles

3 How to do this? Similarity Keywords Extraction News group News group / Keywords Arch News chains

4 Similarity Aim: Clustering Similar News Challenges: What news data are important for the similarity process? How can we use that data? Which methods can we use in this process?how can we evaluate this process?

5 Similarity Filter: Revista de imprensa: destaques de "O Jogo" Jornais do dia Mourinho diz que os seus brasileiros jogaram muito bem. Quiseram embraçá-lo com os 6-2 da goleada sofrida por Portugal. Revista de imprensa: destaques do "Jornal de Notícias Jornais do dia Governo pressiona direcções das escolas. Ministério pondera avaliar conselhos executivos pelo sistema do sector público. Normalization: remove punctuation marks; remove patterns; remove stop-words (snowball); words stemming (ptstemmer)

6 Similarity Title News comparation: Similarity: Teaser Title - ST*; Teaser ( S) - STe*; Content - SC*. Temporary Window T Content * Values between 0 and 1

7 Similarity First Approach Similar Tree (manual threshold assignment; empirical values) Second Approach Classification methods (provide by scikit-learn; automatic approach) Decision Tree; Support Vector Classifier (SVC) SVC Linear Random Forest Gaussian

8 Similarity Features Title Similarity Teaser Similarity Content Similarity Variables: S = 0,2 T=1 Algoritm - Levensthein Stemmer - Porter Stemmer

9 Similarity Dataset 3 millions of Portuguese news published between 2008 and 2013 Training Set Select 100 news of each day (between 23 Dec 2012 and 22 Jan 2013) Annotate randomly 371 comparisons Test Set TS1: Select 501 distinct news from 19 Nov Annotate randomly 5101 comparisons TS2: Select 210 distinct news from 19 Nov Annotate randomly 1047 comparisons

10 Similarity Annotation Interface

11 Similarity Experimental Setup Precision (P) Recall(R) P= TP TP + FP Accuracy(A) R= TP TP + FN F measure (F) A= TP_+ TN TP + TN + FP+ FN True Positives (TP): number of similar news correctly identify; False Positives (FP): number of non similar news identified as similar; True Negatives (TN): number of non similar news correctly identify; False Negatives (FN): number of similar news identified as non similar. F = 2 * P * R P + R

12 Similarity Results and Analyses RandomForest: Random Behaviour P R A F DecisionTree 0,958 0,932 0,985 0,945 SVC 0,993 0,963 0,994 0,978 SVC Linear 0,991 0,963 0,994 0,977 RandomForest 0,987 0,960 0,993 0,974 Gaussian 0,701 0,964 0,956 0,812 Similar Tree 0,999 0,839 0,974 0,912 Gaussian: Worst Performance SVCs results are better than Decision Tree in all metrics SVCs have similar results SVC: Better combination of evaluation metrics

13 News Group

14 News Group News 2014 (3 April to 20 June) Number of news: Cluster number: Average amount of news per cluster: ~ 3,7 March 2014, Number of news: Number of news in news group: 8278

15 Keywords extraction Aim: Extract relevant terms from text. Challenges: Can any word be considered a keyword? Can a news be described by a simple word? a compound word? or an entity? How we can extract useful keywords from the news?

16 Keywords extraction Approach Explicit Keywords Simple (uni-grams) Governo rebeldes busca competição atentado à bomba avião da Malaysia Airlines fase de grupos Bagdade Malásia Rui Patrício Compound (n-grams) Tribunal Constitucional Implicit Keywords Entities Presidente República

Keywords extraction Explicit Keywords Pos Tagger (Pablo Gamallo) [n-grams] Normalization: Remove Patterns Stemmer [uni-grams] Term frequency - Inverse document frequency (TF-IDF): o(w, DOC): number

17 Keywords extraction Explicit Keywords Pos Tagger (Pablo Gamallo) [n-grams] Normalization: Remove Patterns Stemmer [uni-grams] Term frequency - Inverse document frequency (TF-IDF): o(w, DOC): number of occurences of WORD in DOCUMENT; npalavras(doc): number of words in DOCUMENT docs(all): number of documents in the documents collection; docs(w, ALL): number of documents in the documents collection withc contain WORD

18 Keywords extraction Implicit Keywords Normalization Relation between words ( Ventura, Silva 2013) Corr(A,B) is based on Pearson s correlation coefficient; D is the number of documents of corpus D; di is the i-th document in D; size(di) is its number of words and f(a, di) the frequency of term A in di. Corr(A, B) ranges -1 (non correlation) to +1(strong correlation) (Ventura, Silva 2013): Automatic Extraction of Explicit and Implicit Keywords to Build Document Descriptors

19 Keywords extraction Entities Find Entities A idade média dos entrevistados era de 11 anos no início do estudo, sendo rapazes três quartos do total Os jovens que jogam jogos de vídeo têm mais propensão para pensar e agir de forma agressiva, indica um estudo feito a mais de estudantes em Singapura e hoje divulgado. O estudo, publicado pela revista da American Medical Association e baseado em três anos de trabalho com jovens, concluiu, com base nas respostas dos estudantes, que havia uma ligação entre o uso frequente de jogos de vídeo e as altas taxas de comportamentos e pensamentos agressivos.

20 Keywords extraction Dataset 4789 news articles from January to December (2012) Test set: select one day from each month of 2012 select three hours of each day extract keywords select 10 news from each day check manually the keywords

Keywords extraction Experimental Setup PalavrasChaveRepresentativas Number of words that represents the news PalavrasChaveAtribuídas

21 Keywords extraction Experimental Setup PalavrasChaveRepresentativas Number of words that represents the news PalavrasChaveAtribuídas Number of words attributed to news N number of news Results Evaluation Explicit - Simple 0,732 Explicit - Compound 0,762 Implicit ~0 Entity 0,804

22 News Group / Keywords Aim: associate keywords to newsgroups according their weight

23 Arch Aim: Connect groups of news Challenges: How can we aggregate news clusters? What fields need to be considered?

24 Arch Approach (explicit simple keywords, entities and personalities) Normalization lowercase explicit simple keywords - reduce words to their stem Find Personalities From entities and explicit compound keywords using Verbetes. Distance: ka number of words in news group a; kb number of words in news group b; Wkja: weigth world j in news group a; Wkib: weigth world i in news group b; D1 and D2: range from 0 to 1

25 Arch Approach (explicit compound keywords) Normalization lowercase remove stop-words All words have the same weigth Distance: Edit distance algorithm - qgrams - q=3

26 Arch Goldstandard 1408 news (2012, January) 131 groups of news Trainset: 5671 comparisons between groups of news 277 connections 5394 non connections Testset: 300 comparisons between groups of news 26 connections 247 non connections

27 Arch Experiences 1. 6 Experiences Metrics to calculate distance(d1 and D2) Experiences Constraints to comparisons - number of entities - number of personalities - similarity between explicit simple keywords

28 Arch Experimental Setup Precision (P) Recall(R) P= TP TP + FP True Positives (TP): number of connections correctly identify; False Positives (FP): number of non connections identified as connections; True Negatives (TN): number of non connections correctly identify; False Negatives (FN): number of connections identified as non connections. R= TP TP + FN

29 Arch Results and Analyses Experiences Metrics: a. b. c. Explicit simple keyword: D1 Personalities: D1 Entities: D2 Constrains: a. Entities >= 3 b. Explicit simple keyword similarity >= 0,2 Best Result Gaussian Precision 0,941 Recall 0,308

30 News Chains

31 Thanks! Carla Abreu Acknowledgement Bruno Tavares Connecting the dots between news

Feature Subset Selection in E-mail Spam Detection

Feature Subset Selection in E-mail Spam Detection Amir Rajabi Behjat, Universiti Technology MARA, Malaysia IT Security for the Next Generation Asia Pacific & MEA Cup, Hong Kong 14-16 March, 2012 Feature