VERBATIM Automatic Extraction of Quotes and Topics from News Feeds



Similar documents
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search and Information Retrieval

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Sentiment analysis on tweets in a financial domain

Knowledge Discovery from patents using KMX Text Analytics

Domain Classification of Technical Terms Using the Web

Index Terms: Online Ticket Resolving System (OTRS), Network Operation Center(NOCs), Incident Management(INC),

Chapter 2 Automatic Expansion of a Social Network Using Sentiment Analysis

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

An Introduction to Machine Learning and Natural Language Processing Tools

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Mining a Corpus of Job Ads

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

A MACHINE LEARNING APPROACH TO SERVER-SIDE ANTI-SPAM FILTERING 1 2

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Controlling Spam at the Routers

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Semantic Search in Portals using Ontologies

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Sentiment analysis for news articles

How To Write A Summary Of A Review

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Kofax Transformation Modules Generic Versus Specific Online Learning

The Enron Corpus: A New Dataset for Classification Research

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

CENG 734 Advanced Topics in Bioinformatics

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Term extraction for user profiling: evaluation by the user

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Active Learning SVM for Blogs recommendation

Search Result Optimization using Annotators

Machine Learning using MapReduce

Projektgruppe. Categorization of text documents via classification

Semi-Supervised Learning for Blog Classification

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

High Productivity Data Processing Analytics Methods with Applications

Distributed Computing and Big Data: Hadoop and MapReduce

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

WE DEFINE spam as an message that is unwanted basically

Online Cost-Sensitive Learning for Efficient Interactive Classification

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Research Article International Journal of Emerging Research in Management &Technology ISSN: (Volume-4, Issue-4) Abstract-

How To Cluster On A Search Engine

Clustering Technique in Data Mining for Text Documents

Server Load Prediction

Blog Post Extraction Using Title Finding

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Introduction. A. Bellaachia Page: 1

Web Advertising Personalization using Web Content Mining and Web Usage Mining Combination

AUTOMATIC CLASSIFICATION OF QUESTIONS INTO BLOOM'S COGNITIVE LEVELS USING SUPPORT VECTOR MACHINES

Data Mining in Personal Management

Facilitating Business Process Discovery using Analysis

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Investigation of Support Vector Machines for Classification

Spam Detection Using Customized SimHash Function

Bug Localization Using Revision Log Analysis and Open Bug Repository Text Categorization

Inner Classification of Clusters for Online News

Tivoli Security Information and Event Manager V1.0

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Document Image Retrieval using Signatures as Queries

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

An Approach to support Web Service Classification and Annotation

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

ImageCLEF 2011

Assisting bug Triage in Large Open Source Projects Using Approximate String Matching

Machine Learning in Spam Filtering

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

How can we discover stocks that will

Emotion Detection from Speech

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Sentiment Analysis on Big Data

Car Insurance. Prvák, Tomi, Havri

Application of Data Mining based Malicious Code Detection Techniques for Detecting new Spyware

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

Large Scale Learning to Rank

CATEGORIZATION OF SIMILAR OBJECTS USING BAG OF VISUAL WORDS AND k NEAREST NEIGHBOUR CLASSIFIER

Transcription:

VERBATIM Automatic Extraction of Quotes and Topics from News Feeds Luis Sarmento e Sérgio Nunes 4th Doctoral Symposium on Informatics Engineering Porto, Portugal, on February 5 6, 2009.

Verbatim: Motivation Growth in information production poses increasing challenges to consumers information overflow Tools that work as personal information butlers verbatim acquires information from live news feeds Extracts quotes and topics Presents this information in a web interface. Automatic watchdog by confronting quotes by the same entities on the same topics over time

Related Work (I) NewsExplorer [6] extract quotations in multilingual news. It extracts quotes, the name of the entity making the quote and also entities mentioned Krestel et al. [5] describe the development of a reported speech extension to the GATE framework, for English In [4], the authors propose the TF*PDF algorithm for extracting terms to be used as descriptive tag: most tags are quite uninformative and innappropriate for high-level topic tags

Related Work (II) In-Quotes from Google, presents a web-based interface structured in issues (i.e. topics) and displays side-by-side quotes from two actors at a time However, no implementation details are known Our work is different: It is focused on a single language (Portuguese) It addresses the problem of topic extraction and distillation, while most related works assume that news topics have been previously identified

System Overview Data Acquisition and Parsing Quote Extraction Removal of Duplicates Topic Classification Topic Identification + Generation of Training Set Training the Topic Classifiers Topic Classification Procedure Web Interface Update Routine

Data Acquisition and Parsing Using a fixed number of data feeds from major portuguese mainstream media sources for news gathering only generic mainstream sources in this initial selection Avoid the major challenges faced in web crawling We customized content decoding routines for each individual source. Fetching performed periodically every hour on all sources Content in stored in a UTF 8 encoded format on the server

Quote Extraction Large variety of ways in which quotes can be expressed We only address quotes that explicitly mention the name of the speaker to avoid anaphoric resolution More specifically, we look for sentences in the body of the news feed that match the following pattern: Name of Speaker, Optional Ergonym, Speech-Act, Modifier, Quote O Primeiro-ministro, José Sócrates, anunciou esta terça-feira que o Itinerário Principal 4 (IP4), que liga Vila Real a Bragança, será transformado em auto-estrada daqui a três anos,... 19 matching patterns e 35 Speech-Acts: 5% news We have low recall at this stage (but high precision)

Removal of Duplicates (I) It is usual to find duplicate or near duplicates news from which duplicate quotes will be extracted We try to aggregate the most similar quotes in quote groups, Q_1, Q_2,... Q_last Each new quote, q_new, is compared with the k most recent quote groups: Q_last, Q_last 1, Q_last 2... Q_last k+1 If the similarity between q_new and any of such groups is higher that a given threshold, s_min, then q_new is added to the most similar group.

Removal of Duplicates (II) Otherwise, a new group, Q_new is created, containing q_new only Comparison between the new quotes q_new and the longuest quote for each group First, check if the speakers are the same Then, content similarity is computed vector representation using a binary bag-ofwords approach (stop words are removed) vectors are compared (Jaccard Coefficient) Sim > 0.25, then quotes are considered duplicates

Topic Classification verbatim assigns a topic tag to each quote. Wide variety of topics in the news with new unseen topics can be added as more news are collected Efficient topic classification of news requires: dynamically identify new topics tags as they appear in the news automatically generate a training set using new topic tags re-train the topic classification procedure

Identification of Topics & Generation of Training Set Identification of topic tags by mining a common structure in titles: topic tag: title headline Literatura:"A viagem do elefante", de José Saramago, tem lançamento mundial quinta-feira em São Paulo... From about 26,000 news items, we found 783 different topic tags (occurring in 2+ titles). Generation of a training: For every tag t_i in the set of topic tags found, T group the set of news items for that topic I_i = (i1i, i2i... in i ) We will denote the complete training set as T I

Training the Topic Classifiers Two different text classification approaches Rocchio classification [7] and SVM [2] Both involve representing news as vectors of features We use a bag-of-words approach for vectorizing news feed items (word, frequency) information about the location of each word - title or body of the news - is kept Stop words are removed

Rocchio Classification Rocchio classification is a straight-forward way to classify items using a nearest-neighbour strategy For each topic t_i, of a set of T topics, we need to obtain [c_i ], a vector representing the topic class [c_i ] is computed by aggregating the vectors of news item [i_ij] for that topic (TF-IDF weighting of features is performed) Classification is made by comparing [i_new ], against class descriptions of the T classes We used the cosine, i.e. cos([i_new ], [c_i ])

SVM Classification SVMs are effective for classifying items described in high-dimensional spaces, as in text classification SVMs are binary classifiers, so we need to train one SVM for each topic t_k using I_k as positive examples and I I_k as negative examples: svm_k = train_svm (Ik, I Ik ) Then for a given news item, i_news : svm_k ([inews ]) > 0 if i_news ~ topic tk svm_k ([inews ]) < 0 if i_news!~ topic tk We used the SVM-light [3] with default parameters

Topic Classification Procedure Let T = (t1, t2... tk ) be the set of topic tags over, i_qt be the news items to classify, and let [iqt ] bet its vector representation. Then: find svm_max, the maximum svm_k ([i_qt]), corresponding k = k_svmmax find roc_max, the maximum cos([ck ], [iqt ]), corresponding to k = k_rocmax. if svm_max min_svm, i_qt ~ t_k_max svm elsif rocmax min_roc, i_qt ~ t_k_maxroc else do not classify i_qt (23% of cases)

Web Interface (I)

Web Interface (II)

Update Routines Quote extraction routine (1/1h): Read web feeds available. Run the quote extraction procedure. Store the extracted information in DB. Run the quote duplicate detection routine. Run the classification procedure. Store classification in the database. Topic Identification + Classifier re-training (1/24h): Run the topic detection procedure on all news in DB to build the training set T I. Vectorize all news items. Train Rocchio. Train SVM. Store Rocchio classes and SVMs descriptions

Results and Error Analysis Statistics from early January 2009 / 47 days up: 26,266 news items 570 quotes (68 not quotes - 11.9%). 337 distinct named entities (6 incorrect 1.8%) Over 197 different topics (1 incorrect topic ident) Most of the errors have no impact on readability Classification: 42 topics misattributed (7.4%) No recall figures yet

Conclusions and Future Work (I) Fully functional online service working over live data from the portuguese mainstream media The overall feedback, both online and offline, as been positive Still much work ahead: Increase the number of news sources Improve quotation extraction: creating new rules for other common pattern (both in news body and title) Resolve anaphoric references

Conclusions and Future Work (II) Improving topic extraction and classification news sources are not consistent about the words used to describe the topics: Crise Finaceira vs. Crise Económica or Desporto vs. Futebol Upgrade the end-user interface additional navigational axis based on temporal information (e.g. filter by time feature). Evaluation: develop a reference collection for computing Precision and Recall at several stages

References 1. Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O Callaghan. Clustering data streams. In IEEE Symposium on Foundations of Computer Science, pages 359 366, 2000. 2. Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire Nedellec and C eline Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137 142, Chemnitz, DE, 1998. Springer Verlag, Heidelberg, DE. 3. Thorsten Joachims. Making large-scale svm learning practical. In Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999. software available at http://svmlight.joachims.org/. 4. Khoo Khyou and Bun Mitsuru Ishizuka. Topic extraction from news archive using tf*pdf algorithm. In Proceedings of 3rd Int l Conference on Web Informtion System Engineering (WISE 2002),IEEE Computer Soc, pages 73 82. WISE, 2002. 5. Ralf Krestel, Sabine Bergler, and Rene Witte. Minding the source: Automatic tagging of reported speech in newspaper articles. In Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), May 2008. 6. Bruno Pouliquen, Ralf Steinberger, and Clive Best. Automatic detection of quotations in multilingual news. In Proceedings of Recent Advances in Natural Language Processing 2007, Borovets, Bulgaria, 2007. 7. J. Rocchio. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System, pages 313 323, Englewood, Cliffs, New Jersey, 1971. Prentice Hall.

Verbatim Thank you! http://irlab.fe.up.pt/p/verbatim/ Luís Sarmento: las@fe.up.pt Sérgio Nunes: sergio.nunes@acm.org