Internet as corpus: automatic construction of a Swedish news corpus



Similar documents
Send the completed application form (pages 2-4) to the following address: The Ingmar Bergman Foundation P.O Box Stockholm Sweden

Research Portfolio. Beáta B. Megyesi January 8, 2007

WEB SITE DEVELOPMENT WORKSHEET

User (Student) Instruction Manual Local Document Archive Maintenance System (SOLAD)

Unifying Search for the Desktop, the Enterprise and the Web

Technical Specifications (Excerpt) TrendInfoWorld Web Site

Web Extras. Customer Service Description. Version 3.0. February 26, 2002

Content Management Using Rational Unified Process Part 1: Content Management Defined

How To Design The Web And The Internet

Diagnosis Code Assignment Support Using Random Indexing of Patient Records A Qualitative Feasibility Study

SRCSB General Web Development Policy Guidelines Jun. 2010

Developing Your School Website

Search Engine Optimization for a WebSphere Commerce System

Administrator s Guide

Lesson Overview. Getting Started. The Internet WWW

Personal home pages on the World Wide Web a simple version of a knowledge net?

Search and Information Retrieval

Frequently Asked Questions for the USA TODAY e-newspaper

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.

Introduction to Search Engine Marketing

Usage Analysis Tools in SharePoint Products and Technologies

Adobe Acrobat 6.0 Professional

Newspaper Preservation. by H.R. Mohan Associate VP (Systems) The Hindu Chennai

Technical Report. The KNIME Text Processing Feature:

Business Communication at the Transnational Level

Grails 1.1. Web Application. Development. Reclaiming Productivity for Faster. Java Web Development. Jon Dickinson PUBLISHING J MUMBAI BIRMINGHAM

Make search become the internal function of Internet

INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE

Subscribe to RSS in Outlook Find RSS Feeds. Exchange Outlook 2007 How To s / RSS Feeds 1of 7

iweb for Business Practical Search Engine Optimization A Step by Step Guide

Software User Guide. WordPress Plugin Version 1.0

Enterprise Archive Managed Archiving & ediscovery Services User Manual

Collecting Polish German Parallel Corpora in the Internet

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Digital archiving of scientific information Czech experience

XML-BASED INTEGRATION: A CASE STUDY

Chapter 12 Creating Web Pages

Website Standards Association. Business Website Search Engine Optimization

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture

OpenIMS 4.2. Document Management Server. User manual

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

ANALYZING OF THE EVOLUTION OF WEB PAGES BY USING A DOMAIN BASED WEB CRAWLER

SAP BusinessObjects Business Intelligence Platform Document Version: 4.1 Support Package Business Intelligence Launch Pad User Guide

PDG Software. Site Design Guide

Policy Based Encryption E. Administrator Guide

Policy Based Encryption E. Administrator Guide

Text Analytics. Introduction to Information Retrieval. Ulf Leser

SEO Search Engine Optimization. ~ Certificate ~ For: Q MAR WDH By

Search Engine Optimization with Jahia

McCord Web Services LLC 2254 Ravenna Court Waldorf, MD Voice: Fax: The Tangible Benefits of Blogging

NEWSLETTERS FOR LEAD NURTURING LEADFORMIX BEST PRACTICES

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Dreamweaver CS5. Module 2: Website Modification

THE CONVERSION OF A LINEAR TEXT TO HYPERTEXT

Policy Based Encryption Z. Administrator Guide

Outline. CIW Web Design Specialist. Course Content

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

UH CMS Basics. Cascade CMS Basics Class. UH CMS Basics Updated: June,2011! Page 1

A Java proxy for MS SQL Server Reporting Services

Sonian Getting Started Guide October 2008

Further web design: HTML forms

Configuration Information

Appendix N INFORMATION TECHNOLOGY (IT) YOUTH APPRENTICESHIP WEB & DIGITAL COMMUNICATIONS PATHWAY WEB & DIGITAL MEDIA UNIT UNIT 6

SEO Optimization A Developer s Role

COMPLIANCE MATRIX of GIGW

Self-Service Portal Implementation Guide

Glossary of translation tool types

SYSTRAN v6 Quick Start Guide

IBM Unica emessage Version 8 Release 6 February 13, User's Guide

WebPublish. WebPublish is a modern and powerful publishing tool that can be customized to suit the needs of every organization.

IBM Campaign and IBM Silverpop Engage Version 1 Release 2 August 31, Integration Guide IBM

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Report on Content Management Systems. University Communications Web Services Office March,

hite News & Social Media Naroclips Instant Intelligence

User Guide. Chapter 6. Teacher Pages

Data Deduplication in Slovak Corpora

LabVIEW Internet Toolkit User Guide

Open Source Content Management System for content development: a comparative study

SEO Search Engine Optimization. ~ Certificate ~ For: By. and

ONLINE BUSINESS PLATFORM. SEO Checklist: 7 Essential Checks Before Launching Your Sitefinity Website

Website Search Engine Optimization (SEO) Evaluation XXXXXXX

Website Communication

EBOX Digital Content Management System (CMS) User Guide For Site Owners & Administrators

Overall. Accessibility. Content. Marketing. Technology. Report for help.sitebeam.net. Page pages tested on 29th October 2012

Web Design Foundations ( )

DocuShare User Guide

Content Management Systems: Drupal Vs Jahia

The document may be freely distributed in its entirety, either digitally or in printed format, to all EPiServer Mail users.

Geo Targeting Server location, country-targeting, language declarations & hreflang

Kanzelhöhe Online Data Archive KODA

THE WHOLE INTERNET USER'S GUIDE & CATALOG EDKROL

Administrator's Guide

Website and Marketing Best Practices Guide*

Information Technology Studies

User Guide to the Content Analysis Tool

Question Answering for Dutch: Simple does it

Archive Attender. Version 3.2. White Paper. Archive Attender is a member of the Attender Utilities family.

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Sitecore InDesign Connector 1.1

Chapter 12 Creating Web Pages

Transcription:

Department of Numerical Analysis and Computing Science TRITA-NA-P0117 IPLab-195 Internet as corpus: automatic construction of a Swedish news corpus Martin Hassel Interaction and Presentation Laboratory (IPLab)

Martin Hassel Internet as corpus: automatic construction of a Swedish news corpus Report number: IPLab-195, TRITA-NA-P0117 Publication: August 2001 E-mail of author: xmartin@nada.kth.se Reports can be ordered from: Interaction and Presentation Laboratory (IPLab) Numerical Analysis and Computing Science (Nada) Royal Institure of Technology (KTH) S-100 44 STOCKHOLM, Sweden telephone; + 46 8 790 6280 fax: + 46 8 10 2477 e-mail: reports-iplab@nada.kth.se URL: http://www.nada.kth.se/nada/iplab/

Abstract Internet as Corpus Automatic Construction of a Swedish News Corpus Martin Hassel NADA-KTH Royal Institute of Technology 100 44 Stockholm, Sweden ph: +46 8 790 66 34 fax: +46 8 10 24 77 email: xmartin@nada.kth.se This paper describes the automatic building of a corpus of short Swedish news texts from the Internet, its application and possible future use. The corpus is aimed at research on Information Retrieval, Information Extraction, Named Entity Recognition and Multi Text Summarization. The corpus has been constructed by using an Internet agent, the so called newsagent, downloading Swedish news text from various sources. A small part of this corpus has then been manually tagged with keywords and named entities. The newsagent is also used as a workbench for processing the abundant flows of news texts for various users in a customized format in the application Nyhetsguiden. Keywords News text, Corpus, Swedish, Internet Introduction Two years ago we built an automatic text summarizer called SweSum (Dalianis, 2000) for Swedish text. We wanted to evaluate SweSum but there were no tagged Swedish corpus available to evaluate text summarizers or information retrieval tools processing Swedish as it is for the English speaking community, mainly through the TREC, (Vorhees & Tice 2000), MUC and TIPSTER-SUMMAC evaluation conferences (Mani et al. 1998, Krenn & Samuelsson 1997). The purpose of this project 1 was to construct test bed for new natural language technology tools, i.e. automatic text summarization, named entity tagging, stemming, information retrieval/extraction, etc. In the process of building this system, Nyhetsguiden (Hassel 2001), we also made it capable of gathering the news texts into a corpus, a corpus we have used to train and evaluate such tools as mentioned above. As this corpus is aimed at research on information and language technology applied on redundant text, the system does not, contrary to (Hofland 2000), remove duplicated concordance lines 1. Nyhetsguiden A user centred news delivery system The system has a modular design and consists of three parts, the user interface, the user database and the main application, newsagent. Being modular, the system can be run as a distributed system or on a single web server. When run as a distributed system, at least newsagent must be run on a computer with Internet access. The user interface (Nyhetsguiden) and the user database can reside on either an Internet or Intranet capable server depending on the desired public access to the system. newsagent is the core of the system and is basically a web spider that is run in a console window. The spider is implemented in Perl, which makes it platform independent, that is, it can run on any platform running Perl (Unix/Linux, Windows, Macintosh, BeOS, Amiga, etc). On intervals of 3-5 minutes newsagent searches the designated news sources (Appendix A) for new news texts, that is news texts not seen by the system before. When a new news text is encountered it is fetched, the actual news text and accompanying illustrations are extracted (by removing navigation panels, banners, tables of links, etc). 1 This project is supported by NUTEK (Swedish board for Industrial and Technical Development) FavorIT programme in cooperation with Euroseek AB.

The resulting document is then passed through the system and, depending on configuration; stored, summarized and routed to the end recipient. 2. Construction of a Corpus of Swedish News Texts Traditionally it has been hard work constructing a corpus of news text. In Sweden there are no newspapers that on a yearly basis offer their paper in digital form 2, as some foreign newspapers do (for example Wall Street Journal), meaning that obtaining this material has to be done on demand. Many Swedish newspapers are, when inquired, unwilling to release texts from their archives for research purposes, and even when they do, it is often the question of a small amount of news texts with an age of several years. This may potentially lead to the exclusion of contemporary words and giving unusually high, or low, occurrence frequencies to words related to phenomena limited to a certain period of time. In the past, the solution would be to collect newspapers in their paper form and type or scan (using a Optical Character Recognition program) them in order to convert them to a format manageable by computers. The World Wide Web is, on the other hand, today a large collection of texts written in different languages and thus giving an abundant resource for language studies already in a format, by necessity, manageable by computers. Many of the web pages are also frequently updated and thus give us a steady access to concurrent use of language in different fields. In this situation, neglecting the usability of Internet as a corpus would be foolish. In our case we used a tool called newsagent that is a set of Perl scripts designed for gathering news texts, news articles and press releases from the web and routing them by mail according to subscribers defined information needs. 3. KTH News Corpus The project with the KTH News Corpus was initiated in May 2000. We started out collecting news telegrams, articles and press releases from three sources but with the ease of adding new sources we settled for twelve steady news sources (Appendix D). The choice of these news sources was based partly on site and page layout, partly on the wish to somewhat balance the corpus over several types of news topics. Among the chosen news sources are both general news, daily press, and specialized news sources. The reason for this is the possibility of comparing how the same event is described depending on targeted reader (wording, level of detail, etc). As of February 2001 we have gathered more than 100.000 texts amounting to over 200Mb with an increase of over 10.000 new texts each month. The increase in word forms during March was almost 230.000. The lengths of the texts vary between 5 and 500 lines with a tendency towards the shorter and an average length of 193 words per text. The texts are stored in HTML tagged format but only the news heading and the body of the news text is preserved. All other page layout and all navigation tables and banners are removed. Each text is tagged with Meta tags storing the information on time and date of publication, source and source URL. We stored the news in different categories (Appendix A) and thus giving the possibility to study the difference in use of language in, for example, news on cultural respectively sports event. We did this using the news sources own categorization of their news texts (finance, sports, domestic, foreign, etc), instead of a reader based categorization, such as described in (Karlgren 2000). The corpus is structured into these categories by the use of catalogue structure, a Hypertext linked index and a search engine driven index thus giving several modes of orientation in the corpus. For the purpose of evaluating a Swedish stemmer in conjunction with a search engine (Carlberger et al 2001), we manually tagged 100 texts TREC style and constructed questions and answers central to each text. We also tagged each text with named entities (names, places, organisations and date/time) and the five most significant keywords for future evaluation purposes. 2 We have as yet only been able to aquire 1995 years issue of Svenska Dagbladet (SVD) and I know that the Scarrie Swedish News Corpus (Dahlqvist 1998) contains all articles published in SVD and Uppsala Nya Tidning (UNT) during the same period. - 2 -

Unfortunately copyright issues remain unsolved, we have no permission from the copyright holders except fair use, and so the corpus can only be used for research within our research group. The tool for gathering the corpus, newsagent, is on the other hand available for use outside our research group (with the exclusion of mail routing and FTP plug-ins). 4. Areas of Use So far the corpus has been used for evaluation and training purposes. Knutsson (2001) has employed the corpus for evaluating error detection rules for Granska (Domeij et al), a program for checking for grammatical errors in Swedish unrestricted text. The tagged texts have besides, as mentioned above, being used for evaluation of a Swedish stemmer also been utilized in the evaluation of SweSum (Dalianis & Hassel 2001), an automatic text summarizer that among other languages handles Swedish unrestricted HTML tagged or untagged ASCII text and for the training and evaluation of a Named Entity Tagger (Dalianis & Åström 2001). In the near future parts of the corpus will be used and for expanding SweSum with Multi Text Summarization. Other possible areas of use are for producing statistics and lexicons, and for developing a Topic Detection Tracking (Wayne 2000), system for Swedish news. 5. The Future of the Corpus I am now on the verge of rewriting the corpus tools since we now are more fully aware of its potential uses. Among planned improvements are: Internal representation in XML Automatic tagging of: o Parts-of-speech o Clause and sentence boundaries o Named Entities (persons, locations, etc.) Automatic summarization of each text Automatic running statistics o Average increase per month/week in number of: Texts Sentences Words Word forms o Average: Text length (in sentences, words & characters) Sentence length (in words & characters) Word length o Total number of: Texts Sentences Words Word forms Hopefully a solution to the current copyright issues A more balanced choice of channels/sources This will hopefully result in a tool that in a short period can build a corpus of plain, tagged and summarized versions of the same news text along with appropriate statistics. 6. Conclusions A concluding remark is that a small piece of programming has grown to a complete system which we had great use of in training and evaluation of various natural language tools and that the newsagent has been a incentive to push our research beyond foreseeable limits. As a part of our online service Nyhetsguiden we have also gained as much as fifty willing beta testers of our language technology tools. We are now on the verge to incorporate our new Named Entity Tagger into newsagent. We also believe that this proves that it is feasible to acquire a substantial corpus, over a short period of time, from the Internet. One may argue that as long as copyright issues - 3 -

are note solved, the corpus has no legal use outside our research group. While this is true, the corpus has been of great use to us in our research and the corpus tools still remain for public use. The tools have proven to be practically service free run without major problems. Since the same news reports are, potentially, repeated over news sources and time, the resulting corpus will be of much use for research on Information Extraction/Retrieval and Topic Detection Tracking. Acknowledgements I would like to thank Hercules Dalianis and Ola Knutsson for comments on early versions of this paper. 7. References J. Carlberger, H. Dalianis, M. Hassel, O. Knutsson 2001. Improving Precision in Information Retrieval for Swedish using Stemming. NODALIDA 01-13th Nordic Conference on Computational Linguistics, May 21-22 2001, Uppsala, Sweden. B. Dahlqvist 1998. The SCARRIE Swedish News Corpus, Reports from the SCARRIE project, editor A. Sågwall Hein. Uppsala University, Sweden H.Dalianis 2000. SweSum - A Text Summarizer for Swedish, IPlab-174, Technical report, NADA, KTH October. http://www.nada.kth.se/~hercules/textsumsummary.html H. Dalianis and M. Hassel 2001. Development of a Swedish Tagged Corpora for Evaluating Summarizers. NADA-KTH. Technical report for the SeaSum project, FavorIT programme, NUTEK. http://www.nada.kth.se/~hercules/papers/textsumdraft.pdf H. Dalianis and E. Åström 2001. A Swedish Named Entity Tagger Its Construction and Evaluation. NADA-KTH (forthcoming). R. Domeij, O. Knutsson, J. Carlberger, V. Kann 1999. Granska - an efficient hybrid system for Swedish grammar checking. NODALIDA 99, 9-10 December 1999, Trondheim, Norway. M. Hassel 2001. newsagent - A tool for automatic news surveillance and corpora building. Technical report for the SeaSum project, FavorIT programme, NUTEK. http://www.nada.kth.se/~xmartin/papers/nutek.pdf K. Hofland 2000. A self-expanding corpus based on newspapers on the Web. In Proceedings of Second Internation Conference on Language Resources and Evaluation. LREC-2000 Athens, Greece, 31 May - 2 June 2000. pp 1271-1272. J. Karlgren 2000. Assembling a Balanced Corpus from the Internet. In Stylistic Experiments for Information Retrieval. Dissertation for the Degree of Doctor of Philosophy, Stockholm University, Department of Linguistics. pp 99-104. O. Knutsson 2001. Automatisk språkgranskning av svensk text (in Swedish). Dissertation for the Degree of Licentiate of Philosophy, Kungliga Tekniska Högskolan, NADA. B. Krenn and C. Samuelsson 1997. The Linguist's Guide to Statistics (Chapter 3 Basic Corpus Linguistics, http://www.coli.uni-sb.de/~krenn/edu.html I. Mani, T. Firmin, D. House, M. Chrzanowski, M. Klein. G. Hirschman, B.Sundheim and L. Obrst 1998. The TIPSTER Text Summarization Evaluation. Final Report. Mitre Technical Report MTR 98W0000138, October 1998. E.M. Vorhees and D.M.Tice 2000. The TREC-8 Question Answering System Track, In the proceedings of Second International Conference on Language Resources and Evaluation. LREC-2000 Athens, Greece, 31 May-2 June 2000. pp. 1501-1508. C. Wayne 2000. Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation. In Proceedings of Second Internation Conference on Language Resources and Evaluation. LREC-2000 Athens, Greece, 31 May - 2 June 2000. - 4 -

Appendix A News sources and categories used by newsagent. Source: Aftonbladet Amnesty International BIT.se (Sifo Group) Dagens Industri Dagens Nyheter Homoplaneten (RFSL) Tidningen Mobil International Data Group Medströms Förlag Senaste Nytt.com Svenska Dagbladet Svenska Eko-nyheter Sveriges Riksdag Categories: - Economics, cultural, sports, domestic and foreign news - Press releases and news on human rights - Press releases from companies - News on the industrial market - Economics, cultural, sports, domestic and foreign news - News concerning rights of the homosexual community - News articles on mobile communication - News articles on computers - News articles on computers - News flashes (discontinued) - News flashes - News flashes - Press releases from the Swedish Parliament - 5 -