VERBATIM Automatic Extraction of Quotes and Topics from News Feeds Luis Sarmento e Sérgio Nunes 4th Doctoral Symposium on Informatics Engineering Porto, Portugal, on February 5 6, 2009.
Verbatim: Motivation Growth in information production poses increasing challenges to consumers information overflow Tools that work as personal information butlers verbatim acquires information from live news feeds Extracts quotes and topics Presents this information in a web interface. Automatic watchdog by confronting quotes by the same entities on the same topics over time
Related Work (I) NewsExplorer [6] extract quotations in multilingual news. It extracts quotes, the name of the entity making the quote and also entities mentioned Krestel et al. [5] describe the development of a reported speech extension to the GATE framework, for English In [4], the authors propose the TF*PDF algorithm for extracting terms to be used as descriptive tag: most tags are quite uninformative and innappropriate for high-level topic tags
Related Work (II) In-Quotes from Google, presents a web-based interface structured in issues (i.e. topics) and displays side-by-side quotes from two actors at a time However, no implementation details are known Our work is different: It is focused on a single language (Portuguese) It addresses the problem of topic extraction and distillation, while most related works assume that news topics have been previously identified
System Overview Data Acquisition and Parsing Quote Extraction Removal of Duplicates Topic Classification Topic Identification + Generation of Training Set Training the Topic Classifiers Topic Classification Procedure Web Interface Update Routine
Data Acquisition and Parsing Using a fixed number of data feeds from major portuguese mainstream media sources for news gathering only generic mainstream sources in this initial selection Avoid the major challenges faced in web crawling We customized content decoding routines for each individual source. Fetching performed periodically every hour on all sources Content in stored in a UTF 8 encoded format on the server
Quote Extraction Large variety of ways in which quotes can be expressed We only address quotes that explicitly mention the name of the speaker to avoid anaphoric resolution More specifically, we look for sentences in the body of the news feed that match the following pattern: Name of Speaker, Optional Ergonym, Speech-Act, Modifier, Quote O Primeiro-ministro, José Sócrates, anunciou esta terça-feira que o Itinerário Principal 4 (IP4), que liga Vila Real a Bragança, será transformado em auto-estrada daqui a três anos,... 19 matching patterns e 35 Speech-Acts: 5% news We have low recall at this stage (but high precision)
Removal of Duplicates (I) It is usual to find duplicate or near duplicates news from which duplicate quotes will be extracted We try to aggregate the most similar quotes in quote groups, Q_1, Q_2,... Q_last Each new quote, q_new, is compared with the k most recent quote groups: Q_last, Q_last 1, Q_last 2... Q_last k+1 If the similarity between q_new and any of such groups is higher that a given threshold, s_min, then q_new is added to the most similar group.
Removal of Duplicates (II) Otherwise, a new group, Q_new is created, containing q_new only Comparison between the new quotes q_new and the longuest quote for each group First, check if the speakers are the same Then, content similarity is computed vector representation using a binary bag-ofwords approach (stop words are removed) vectors are compared (Jaccard Coefficient) Sim > 0.25, then quotes are considered duplicates
Topic Classification verbatim assigns a topic tag to each quote. Wide variety of topics in the news with new unseen topics can be added as more news are collected Efficient topic classification of news requires: dynamically identify new topics tags as they appear in the news automatically generate a training set using new topic tags re-train the topic classification procedure
Identification of Topics & Generation of Training Set Identification of topic tags by mining a common structure in titles: topic tag: title headline Literatura:"A viagem do elefante", de José Saramago, tem lançamento mundial quinta-feira em São Paulo... From about 26,000 news items, we found 783 different topic tags (occurring in 2+ titles). Generation of a training: For every tag t_i in the set of topic tags found, T group the set of news items for that topic I_i = (i1i, i2i... in i ) We will denote the complete training set as T I
Training the Topic Classifiers Two different text classification approaches Rocchio classification [7] and SVM [2] Both involve representing news as vectors of features We use a bag-of-words approach for vectorizing news feed items (word, frequency) information about the location of each word - title or body of the news - is kept Stop words are removed
Rocchio Classification Rocchio classification is a straight-forward way to classify items using a nearest-neighbour strategy For each topic t_i, of a set of T topics, we need to obtain [c_i ], a vector representing the topic class [c_i ] is computed by aggregating the vectors of news item [i_ij] for that topic (TF-IDF weighting of features is performed) Classification is made by comparing [i_new ], against class descriptions of the T classes We used the cosine, i.e. cos([i_new ], [c_i ])
SVM Classification SVMs are effective for classifying items described in high-dimensional spaces, as in text classification SVMs are binary classifiers, so we need to train one SVM for each topic t_k using I_k as positive examples and I I_k as negative examples: svm_k = train_svm (Ik, I Ik ) Then for a given news item, i_news : svm_k ([inews ]) > 0 if i_news ~ topic tk svm_k ([inews ]) < 0 if i_news!~ topic tk We used the SVM-light [3] with default parameters
Topic Classification Procedure Let T = (t1, t2... tk ) be the set of topic tags over, i_qt be the news items to classify, and let [iqt ] bet its vector representation. Then: find svm_max, the maximum svm_k ([i_qt]), corresponding k = k_svmmax find roc_max, the maximum cos([ck ], [iqt ]), corresponding to k = k_rocmax. if svm_max min_svm, i_qt ~ t_k_max svm elsif rocmax min_roc, i_qt ~ t_k_maxroc else do not classify i_qt (23% of cases)
Web Interface (I)
Web Interface (II)
Update Routines Quote extraction routine (1/1h): Read web feeds available. Run the quote extraction procedure. Store the extracted information in DB. Run the quote duplicate detection routine. Run the classification procedure. Store classification in the database. Topic Identification + Classifier re-training (1/24h): Run the topic detection procedure on all news in DB to build the training set T I. Vectorize all news items. Train Rocchio. Train SVM. Store Rocchio classes and SVMs descriptions
Results and Error Analysis Statistics from early January 2009 / 47 days up: 26,266 news items 570 quotes (68 not quotes - 11.9%). 337 distinct named entities (6 incorrect 1.8%) Over 197 different topics (1 incorrect topic ident) Most of the errors have no impact on readability Classification: 42 topics misattributed (7.4%) No recall figures yet
Conclusions and Future Work (I) Fully functional online service working over live data from the portuguese mainstream media The overall feedback, both online and offline, as been positive Still much work ahead: Increase the number of news sources Improve quotation extraction: creating new rules for other common pattern (both in news body and title) Resolve anaphoric references
Conclusions and Future Work (II) Improving topic extraction and classification news sources are not consistent about the words used to describe the topics: Crise Finaceira vs. Crise Económica or Desporto vs. Futebol Upgrade the end-user interface additional navigational axis based on temporal information (e.g. filter by time feature). Evaluation: develop a reference collection for computing Precision and Recall at several stages
References 1. Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O Callaghan. Clustering data streams. In IEEE Symposium on Foundations of Computer Science, pages 359 366, 2000. 2. Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire Nedellec and C eline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137 142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. 3. Thorsten Joachims. Making large-scale svm learning practical. In Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999. software available at http://svmlight.joachims.org/. 4. Khoo Khyou and Bun Mitsuru Ishizuka. Topic extraction from news archive using tf*pdf algorithm. In Proceedings of 3rd Int l Conference on Web Informtion System Engineering (WISE 2002),IEEE Computer Soc, pages 73 82. WISE, 2002. 5. Ralf Krestel, Sabine Bergler, and Rene Witte. Minding the source: Automatic tagging of reported speech in newspaper articles. In Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), May 2008. 6. Bruno Pouliquen, Ralf Steinberger, and Clive Best. Automatic detection of quotations in multilingual news. In Proceedings of Recent Advances in Natural Language Processing 2007, Borovets, Bulgaria, 2007. 7. J. Rocchio. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System, pages 313 323, Englewood, Cliffs, New Jersey, 1971. Prentice Hall.
Verbatim Thank you! http://irlab.fe.up.pt/p/verbatim/ Luís Sarmento: las@fe.up.pt Sérgio Nunes: sergio.nunes@acm.org