Automated News Item Categorization Hrvoje Bacan, Igor S. Pandzic* Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia {Hrvoje.Bacan,Igor.Pandzic}@fer.hr *currently JSPS Invitation Fellow, Kyoto University, Nishida-Sumi Lab Darko Gulija Croatian News Agency HINA Zagreb, Croatia Darko.Gulija@hina.hr Text categorization Procedure of labeling a textual document with one or more predefined categories Usage: Information retrieval systems Web site classification Spam filters Categorization of news items JSAI 2005. Automated News Item Categorization 2
Importance of metadata in news industry Dramatic increase of news quantity > overflooding > decrease in information usability In news industry, speed is essential, so recipients must rely on metadata Practically, a news story without metadata or with wrong metadata does not exist JSAI 2005. Automated News Item Categorization 3 News industry standards International Press Telecommunications Council (IPTC) develops international standards for news data interchange NewsCodes TM : standard coding of metadata NewsML TM : standard language for news exchange JSAI 2005. Automated News Item Categorization 4
NewsCodes TM : standard for news metadata Genre, Confidence, Urgency, Format etc. Subject Reference System (SRS) oldest and most used NewsCodes TM set defines approx. 1000 categories of news 3 hierarchical levels SRS top level (17 categories): Arts, Culture and Entertainment Crime, Law and Justice Disaster and Accident Economy, Business and Finance Education Environmental Issue Health etc. JSAI 2005. Automated News Item Categorization 5 NewsML TM : news exchange language Standard markup language for global news exchange Based on XML Intended for electronic production, delivery and archiving of news items Incorporates NewsCodes TM metadata Accepted by many major news agencies in process of becoming a national standard in Japan JSAI 2005. Automated News Item Categorization 6
Text categorization in news industry Necessary, but human categorization not practical: slow inconsistent Many news providers use automatic tools fast, consistent, pretty good accuracy Business process allows human intervention JSAI 2005. Automated News Item Categorization 7 Text categorization process The task can be divided in two main parts Document indexing Represent a document as numerical vector Training and classification Actually classify the indexed document JSAI 2005. Automated News Item Categorization 8
Document indexing Each document is represented by a set of weights corresponding to representative keywords (terms) Feature selection Which set of terms to use? Selected once for the whole corpus of documents Weight assignment for each document JSAI 2005. Automated News Item Categorization 9 Feature selection Choose a set of keywords (terms) that are useful in distinguishing documents from each other Not all terms are equally useful Very frequent terms are too general (e.g. and, the ) Less frequent terms are likely to be more typical and representative for the document contents Very infrequent ones are probably errors or special cases JSAI 2005. Automated News Item Categorization 10
Weight assignment Convert a document into a vector of weights Weight factor should represent the importance of the particular keyword for the document meaning Keyword appearing more frequently in this document is more important Keyword appearing more frequently in other documents is less important Term Frequency Inverse Document Frequency function (tf-idf) JSAI 2005. Automated News Item Categorization 11 Training and classification Index the whole training set of documents 30+ manualy classified training documents for each category K-Nearest Neighbors (k-nn) method Index the unknown document Find k nearest neighbors among the training documents in terms of distance between vectors Predict the category of the unknown news item by majority label of neighbours JSAI 2005. Automated News Item Categorization 12
Implementation The system consists of three components: XML parser for the NewsML TM news items Training algorithm Classification algorithm Implemented as a Java servlet on the web JSAI 2005. Automated News Item Categorization 13 Results Precision measurement 476 manually classified test news items outside training set measured % of test items for which the system gave the same result as manual classification 0,85 0,845 0,84 0,835 0,83 0,825 0,82 0,815 0,81 0,805 Subjective test News professional used the system on 150 news items and scored the result 137 (91,4%) results scored as correct System judged as suitable for practical use k=5 k=10 k=15 k=20 JSAI 2005. Automated News Item Categorization 14
Conclusions and ongoing work The system is useful enough to be used in news production process Currently being installed as web service at Croatian News Agency Needs extension to second IPTC category level, hierarchical classification Lessons learned may prove useful in connection with other research interests JSAI 2005. Automated News Item Categorization 15