MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts"

Transcription

1 MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS - Data, Decisions and Language, S.A. Abstract This paper describes the participation of MIRACLE research consortium at the VideoCLEF track at CLEF We took part in both the main mandatory Classification task that consists in classifying videos of television episodes using speech transcripts and metadata, and the Keyframe Extraction task, whose objective is to select keyframes that represent individual episodes from a set of supplied keyframes (one from each shot of the video source). For the first task, our system is composed of two main blocks, the first in charge of building the core system knowledge base, and then the set of operational elements that are needed to classify the speech transcripts of the topic episodes and generate the output in RSS format. For the second task, our approach is based on the assumption that the most representative fragment (shot) of each episode is the one whose distance to the whole episode is the lowest, considering a vector space model. 4 runs were submitted in all. Regarding the classification task, we ranked 3 rd (out of 6 participants) in terms of precision and 2 nd in terms of recall. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.2 Information Storage; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital libraries. H.2 [Database Management]: H.2.5 Heterogeneous Databases; E.2 [Data Storage Representations]. Keywords Video retrieval, domain-specific vocabulary, thesaurus, linguistic engineering, information retrieval, indexing, relevance feedback, multilingual speech transcripts. VideoCLEF, VID2RSS, CLEF, Introduction MIRACLE team is a research consortium formed by research groups of three different universities in Madrid (Universidad Politécnica de Madrid, Universidad Autónoma de Madrid and Universidad Carlos III de Madrid) along with DAEDALUS, a small/medium size enterprise (SME) founded in 1998 as a spin-off of two of these groups and a leading company in the field of linguistic technologies in Spain. MIRACLE has taken part in CLEF since 2003 in many different tracks and tasks, including the main bilingual, monolingual and cross lingual tasks as well as in ImageCLEF [4] [8], Question Answering, WebCLEF, GeoCLEF and VideoCLEF tracks. This paper describes our participation in the VideoCLEF task, a new track for CLEF The goal of this track is to develop and evaluate tasks in processing video content in a multilingual environment. This track for 2008 is dedicated to Vid2RSS task which comprises a number of subtasks including topic classification performed on dual language videos. The main objective involves assigning topic class labels to videos of television episodes. Speech recognition transcripts, metadata records (containing title and description) and video keyframes (and shot boundaries) for each episode are supplied. The video data are Dutch television documentaries and contain Dutch as a dominant language, but also contain a high proportion of spoken English (i.e., interview guests often speak in English). The output format is a set of RSS-feeds, one for each topic class, created by concatenating the metadata records for the episodes assigned to a given topic class. We have participated in the main mandatory Classification task that consists in classifying videos of television episodes using speech transcripts and metadata, and in the Keyframe Extraction task, whose objective is to select

2 keyframes that represent individual episodes from a set of supplied keyframes (one from each shot of the video source). 2. Classification task The objective of the mandatory Classification task is to perform the speech recognition transcript-based topic classification (i.e., classify the videos of the television documentary episodes using the speech recognition output only). Videos include dual language (Dutch + English) episodes. Output is 10 topic-based feeds, each containing the episodes that have been classified in that topic category. The defined topic classes are Archeology, Architecture, Chemistry, Dance, Film, History, Music, Paintings, Scientific research and Visual arts. While classification task I is based on episode speech transcripts only, classification task II allows to use episode metadata for classification. Figure 1 shows the logical architecture of our system. The system is composed of two main blocks. The first block is in charge of building a corpus that can be used as the core system knowledge base. The second block includes the set of operational elements that are needed to classify the speech transcripts of the topic episodes and generate the output in RSS format. Those operational elements cover an information retrieval system and a classifier, as well as auxiliary modules for text extraction, filtering and RSS generation. Figure 1. Overview of the classification system The first step is to obtain the necessary training data for the classifier, as part of the task is, given the description of the subject class, to gather the necessary data to train the classifier. Our knowledge base for training the classifier was generated from Wikipedia articles. In order to do so, we first established a matching between the topic classes provided for the task and the classification topics that Wikipedia uses for articles, encoded in metadata. The next step was to obtain a list of Wikipedia articles belonging to each of the 10 topic class, for each task language, i.e. English and Dutch. Table 1 shows the number of articles for each topic class and language. Table 1. Articles in the training set Topic EN NL Archaeology Architecture Chemistry Dance Film History Music Paintings Scientific research Visual arts

3 Next, each document is processed through the following sequence of operations: 1. Text extraction: Ad-hoc scripts are run to obtain the actual text content of articles, filtering out Wikipedia tags. 2. Diacritics removal and conversion to lowercase: all terms are normalized by removing diacritics (in the case of Dutch) and changing all letters to lowercase. 3. Filtering: All words recognized as stopwords are filtered out. Stopwords in the two target languages were initially obtained from [6] and afterwards extended using several other sources [3] as well as our own knowledge and resources. 4. Stemming: This process is applied to each one of the terms to be indexed or used for retrieval. Standard stemmers from Porter [5] have been used. The processed corpus is indexed with Lucene retrieval engine [1] to allow a fast and efficient access to information needed for the classification. Two different indexes are built, one for each language. The classifier is based on the k-nearest Neighbour algorithm [7]. To find the class for a given episode, the content is first processed as explained before. Then the whole set of resulting terms is used to build a query that is given to the Lucene search engine to obtain the list of the top k most relevant (i.e., most similar) articles in the Wikipedia-based corpus. Finally, the class of the given episode is the most frequent class in the top k results. After some preliminary experiments, a value of k set to 10 was chosen for our runs. 3. Keyframe extraction task Our system is based on the assumption that, in the context of a vector space model representation [2], the most representative fragment (shot) of each episode (represented by a vector) is the one whose distance to the whole episode (also a vector) is the lowest. The contents of both each shot and the whole episode are first processed as explained before, after extracting the text from the speech transcription. Based on the vector space model, a weighted vector is built for each episode and set of shots, representing the term frequency of the main most significant terms in the given episode. Finally the keyframe extraction module selects the keyframe belonging to the most representative shot in the episode, which is the shot whose vector has the lowest distance from (i.e., is nearest) the vector of the whole episode. The metric used here was the cosine distance [2], although Euclidean distance could also be valid. An overview of the system architecture is shown in Figure 2. Figure 2. Overview of the keyframe extraction system 4. Experiments and results We have submitted different runs for each proposed subtask: three for the classification task and one for the keyframe extraction. Table 2 shows the list of submitted runs. Table 2. List of runs Run Identifier Language Task MIRACLE-CNL NL Classification I + Keyframe_Extraction I MIRACLE-CNLEN NL+EN Classification I MIRACLE-CNLMeta NL Classification II

4 In short, CNL run only uses the index for the Dutch corpus, CNLEN uses both indexes and gathers together results from any of them, and CNLMeta uses the Dutch index but also includes the episode metadata to build the query for the retrieval engine. Table 3 shows the values of precision and recall obtained by the different runs in the classification task. Classes marked in italics are those whose number of relevant documents is equal to 0, i.e., no episode belongs to this class, according to the task organizers. It can be observed that the best precision is achieved with the CNL run in which only the Dutch transcription is used. When the knowledge base and the transcription in English are involved, results are noticeably and significantly worse. This could be directly motivated by the fact that the dominant language of the episodes is Dutch. However, the best modelled class is Music, corresponding to one of the classes that own a higher number of Wikipedia articles in the training set. This may suggest other explanations, such as the fact that training set (the knowledge base) for English is much smaller than the one available for Dutch. Another possible explanation could be that the voice recognition system for English is not as good as for Dutch. Obviously, these issues have to be further studied. Table 3. Classification task results Precision Recall CNL CNLEN CNLMeta CNL CNLEN CNLMeta Archaeology Architecture Chemistry Dance Film History Music Paintings Scientific Research Visual Arts ALL (microaveraged 1 ) ALL (macroaveraged 1 ) By definition, macroaverage values are computed as the mean of values of all classes that are first individually calculated. In contrast, the microaverage measures first obtain the aggregate counts for all classes and then calculate the values of precision and recall. Figure 3. Summary of classification task results

5 Although results for the classification task II, i.e. including metadata, seem to be worse than results for classification task I, this is misleading. For all classes except Scientific research and Visual arts, precision values are higher if metadata is used, but the average value is worse specially due to the very low value for Visual arts. This issue still has to be analyzed. Comparing to other groups, we successfully ranked 3 rd out of 6 participants in terms of precision, 2 nd in terms of recall and also f-score (not shown in the table). Regarding the keyframe extraction task, MIRACLE was the only participant who submitted results. Thus, the evaluation has been manually made. Five native Dutch speakers (in the age range 20-50) were presented with the title and the description of each video episode along with two keyframes, one manually extracted and one automatically extracted provided by us. They were asked to choose which keyframe they preferred. Of the 40 videos, 1 did not have a keyframe. In two cases, the keyframe chosen manually and that chosen by the system was the same. Thus, each subject was asked about 37 different pairs of keyframes. On average, the subjects chose the automatic over the manually selected keyframe in 15.2 cases (41.08%) and the manually over the automatic in 21.8 cases (58.92%). Table 4 shows the results of the evaluation process. The Automatic keyframe and Manual keyframe column represent the number of episodes that the evaluator has selected as more adequate between the automatically or manually extracted keyframe. The last column tries to show the number of correctly extracted keyframes, assuming that the evaluator has chosen correctly. Table 4. Keyframe extraction task results Automatic keyframe Manual Keyframe Well selected? Subject Subject Subject Subject Subject Average 41.08% 58.92% 44.10% These promising figures indicate that the automatically extracted keyframes may be strong competitors with the manual ones in the short- or middle-term future. 5. Conclusions and Future Work After a preliminary analysis of results obtained in the classification task, we can conclude that there seems to be a direct relationship between the knowledge base associated to a given class and results achieved in it. This probably indicates that the architecture of the system and provided algorithms are useful, but more effort must be invested to improve the knowledge base, both in its volume (coverage) and the pre-processing activities. Despite the subjectivity of the keyframe extraction task and lack of any reference experiment to which compare our own system, we can say that these results are promising and encourage us to keep on this line of research for future participations. Acknowledgements This work has been partially supported by the Spanish R+D National Plan, by means of the project BRAVO (Multilingual and Multimodal Answers Advanced Search Information Retrieval), TIN C03-03 and by Madrid R+D Regional Plan, by means of the project MAVIR (Enhancing the Access and the Visibility of Networked Multilingual Information for the Community of Madrid), S-0505/TIC/ References [1] Apache Lucene project. On line [Visited 10/08/2008]. [2] Baeza-Yates, R., Ribeiro-Prieto B.: Modern Information Retrieval. Addison Wesley (1999).

6 [3] CLEF 2005 Multilingual Information Retrieval resources page. On line ~gjones/clef2005/multi-8/ [Visited 10/08/2008]. [4] Martínez-Fernández, J.L.; Villena-Román, Julio; García-Serrano, Ana M.; González-Cristóbal, José Carlos. Combining Textual and Visual Features for Image Retrieval. Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Vienna, Austria, Revised Selected Papers. Carol Peters et al (Eds.). Lecture Notes in Computer Science, Vol. 4022, ISSN: [5] Porter, Martin. Snowball stemmers and resources page. On line [Visited 10/08/2008]. [6] University of Neuchatel. Page of resources for CLEF (Stopwords, transliteration, stemmers ). On line [Visited 10/08/2008]. [7] Witten, Ian H.; Frank, Eibe. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, [8] Villena-Román, Julio; Lana-Serrano, Sara; Martínez-Fernández, José Luis; González-Cristóbal, José Carlos. MIRACLE at ImageCLEFphoto 2007: Evaluation of Merging Strategies for Multilingual and Multimedia Information Retrieval. Working Notes of the 2007 CLEF Workshop, Budapest, Hungary, September 2007.

UAIC: Participation in VideoCLEF Task

UAIC: Participation in VideoCLEF Task UAIC: Participation in VideoCLEF Task Tudor-Alexandru Dobrilă, Mihail-Ciprian Diaconaşu, Irina-Diana Lungu, Adrian Iftene UAIC: Faculty of Computer Science, Alexandru Ioan Cuza University, Romania {tudor.dobrila,

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Modules: Automatic tagging, and (2) Manual revision by expert linguists, controlled by devoted tool.

Modules: Automatic tagging, and (2) Manual revision by expert linguists, controlled by devoted tool. Morphosyntactic Annotation in Spanish Service (PoS and lemmatization) Authors: Antonio Moreno-Sandoval, Head of the Laboratorio de Lingüística Informática (Computational Linguistics Lab) and José María

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

2009 Springer. This document is published in: Adaptive Multimedia Retrieval. LNCS 6535 (2009) pp. 12 23 DOI: 10.1007/978-3-642-18449-9_2

2009 Springer. This document is published in: Adaptive Multimedia Retrieval. LNCS 6535 (2009) pp. 12 23 DOI: 10.1007/978-3-642-18449-9_2 Institutional Repository This document is published in: Adaptive Multimedia Retrieval. LNCS 6535 (2009) pp. 12 23 DOI: 10.1007/978-3-642-18449-9_2 2009 Springer Some Experiments in Evaluating ASR Systems

More information

SINAI at WEPS-3: Online Reputation Management

SINAI at WEPS-3: Online Reputation Management SINAI at WEPS-3: Online Reputation Management M.A. García-Cumbreras, M. García-Vega F. Martínez-Santiago and J.M. Peréa-Ortega University of Jaén. Departamento de Informática Grupo Sistemas Inteligentes

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval Overview of iclef 2008: search log analysis for Multilingual Image Retrieval Julio Gonzalo Paul Clough Jussi Karlgren UNED U. Sheffield SICS Spain United Kingdom Sweden julio@lsi.uned.es p.d.clough@sheffield.ac.uk

More information

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information

More information

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task

LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task LABERINTO at ImageCLEF 2011 Medical Image Retrieval Task Jacinto Mata, Mariano Crespo, Manuel J. Maña Dpto. de Tecnologías de la Información. Universidad de Huelva Ctra. Huelva - Palos de la Frontera s/n.

More information

TweetAlert: Semantic Analytics in Social Networks for Citizen Opinion Mining in the City of the Future

TweetAlert: Semantic Analytics in Social Networks for Citizen Opinion Mining in the City of the Future TweetAlert: Semantic Analytics in Social Networks for Citizen Opinion Mining in the City of the Future Julio Villena-Román 1,2, Adrián Luna-Cobos 1,3, José Carlos González-Cristóbal 3,1 1 DAEDALUS - Data,

More information

Facilitating Business Process Discovery using Email Analysis

Facilitating Business Process Discovery using Email Analysis Facilitating Business Process Discovery using Email Analysis Matin Mavaddat Matin.Mavaddat@live.uwe.ac.uk Stewart Green Stewart.Green Ian Beeson Ian.Beeson Jin Sa Jin.Sa Abstract Extracting business process

More information

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Automatic Web Page Classification

Automatic Web Page Classification Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

More information

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet Muhammad Atif Qureshi 1,2, Arjumand Younus 1,2, Colm O Riordan 1,

More information

Research on News Video Multi-topic Extraction and Summarization

Research on News Video Multi-topic Extraction and Summarization International Journal of New Technology and Research (IJNTR) ISSN:2454-4116, Volume-2, Issue-3, March 2016 Pages 37-39 Research on News Video Multi-topic Extraction and Summarization Di Li, Hua Huo Abstract

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

The University of Lisbon at CLEF 2006 Ad-Hoc Task

The University of Lisbon at CLEF 2006 Ad-Hoc Task The University of Lisbon at CLEF 2006 Ad-Hoc Task Nuno Cardoso, Mário J. Silva and Bruno Martins Faculty of Sciences, University of Lisbon {ncardoso,mjs,bmartins}@xldb.di.fc.ul.pt Abstract This paper reports

More information

MIRACLE Question Answering System for Spanish at CLEF 2007

MIRACLE Question Answering System for Spanish at CLEF 2007 MIRACLE Question Answering System for Spanish at CLEF 2007 César de Pablo-Sánchez, José Luis Martínez, Ana García-Ledesma, Doaa Samy, Paloma Martínez, Antonio Moreno-Sandoval, Harith Al-Jumaily Universidad

More information

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

Towards a Visually Enhanced Medical Search Engine

Towards a Visually Enhanced Medical Search Engine Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The

More information

2014, IJARCSSE All Rights Reserved Page 629

2014, IJARCSSE All Rights Reserved Page 629 Volume 4, Issue 6, June 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Improving Web Image

More information

Architectural Components of Digital Library: A Practical Example Using DSpace

Architectural Components of Digital Library: A Practical Example Using DSpace Architectural Components of Digital Library: A Practical Example Using DSpace Sandip Das MS-LIS Student, DRTC ISI Bangalore, sandip@drtc.isibang.ac.in M. Krishnamurty DRTC, ISI Bangalore-560059 mkrishna_murthy@hotmail.com

More information

Name of Module: Big Data ECTS: 6 Module-ID: Person Responsible for Module (Name, Mail address): Angel Rodríguez, arodri@fi.upm.es

Name of Module: Big Data ECTS: 6 Module-ID: Person Responsible for Module (Name, Mail address): Angel Rodríguez, arodri@fi.upm.es Name of Module: Big Data ECTS: 6 Module-ID: Person Responsible for Module (Name, Mail address): Angel Rodríguez, arodri@fi.upm.es University: UPM Departments: DATSI, DLSIIS 1. Prerequisites for Participation

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

Improving Non-English Web Searching (inews07)

Improving Non-English Web Searching (inews07) SIGIR 2007 WORKSHOP REPORT Improving Non-English Web Searching (inews07) Fotis Lazarinis Technological Educational Institute Mesolonghi, Greece lazarinf@teimes.gr Jesus Vilares Ferro University of A Coruña

More information

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database

ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database ToxiCat: Hybrid Named Entity Recognition services to support curation of the Comparative Toxicogenomic Database Dina Vishnyakova 1,2, 4, *, Julien Gobeill 1,3,4, Emilie Pasche 1,2,3,4 and Patrick Ruch

More information

Search Engine Architecture I

Search Engine Architecture I Search Engine Architecture I Software Architecture The high level structure of a software system Software components The interfaces provided by those components The relationships between those components

More information

Search Result Optimization using Annotators

Search Result Optimization using Annotators Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,

More information

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS Caldas, Carlos H. 1 and Soibelman, L. 2 ABSTRACT Information is an important element of project delivery processes.

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

Overview. Clustering. Clustering vs. Classification. Supervised vs. Unsupervised Learning. Connectionist and Statistical Language Processing

Overview. Clustering. Clustering vs. Classification. Supervised vs. Unsupervised Learning. Connectionist and Statistical Language Processing Overview Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes clustering vs. classification supervised vs. unsupervised

More information

LogCLEF 2011 Multilingual Log File Analysis: Language identification, query classification, and success of a query

LogCLEF 2011 Multilingual Log File Analysis: Language identification, query classification, and success of a query LogCLEF 2011 Multilingual Log File Analysis: Language identification, query classification, and success of a query Giorgio Maria Di Nunzio 1, Johannes Leveling 2, and Thomas Mandl 3 1 Department of Information

More information

Exploring Practical Data Mining Techniques at Undergraduate Level

Exploring Practical Data Mining Techniques at Undergraduate Level Exploring Practical Data Mining Techniques at Undergraduate Level ERIC P. JIANG University of San Diego 5998 Alcala Park, San Diego, CA 92110 UNITED STATES OF AMERICA jiang@sandiego.edu Abstract: Data

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

M3039 MPEG 97/ January 1998

M3039 MPEG 97/ January 1998 INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND ASSOCIATED AUDIO INFORMATION ISO/IEC JTC1/SC29/WG11 M3039

More information

Automated Collaborative Filtering Applications for Online Recruitment Services

Automated Collaborative Filtering Applications for Online Recruitment Services Automated Collaborative Filtering Applications for Online Recruitment Services Rachael Rafter, Keith Bradley, Barry Smyth Smart Media Institute, Department of Computer Science, University College Dublin,

More information

SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK

SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK Antonella Carbonaro, Rodolfo Ferrini Department of Computer Science University of Bologna Mura Anteo Zamboni 7, I-40127 Bologna, Italy Tel.: +39 0547 338830

More information

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials ehealth Beyond the Horizon Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 2008 Organizing Committee of MIE 2008. All rights reserved. 3 An Ontology Based Method to Solve Query Identifier Heterogeneity

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

WEB PAGE CATEGORISATION BASED ON NEURONS

WEB PAGE CATEGORISATION BASED ON NEURONS WEB PAGE CATEGORISATION BASED ON NEURONS Shikha Batra Abstract: Contemporary web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the WWW.

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING MEDIA MONITORING AND ANALYSIS GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING Searchers Reporting Delivery (Player Selection) DATA PROCESSING AND CONTENT REPOSITORY ADMINISTRATION AND MANAGEMENT

More information

Er is door mij gebruik gemaakt van dia s uit presentaties van o.a. Anastasios Kesidis, CIL, Athene Griekenland, en Asaf Tzadok, IBM Haifa Research Lab

Er is door mij gebruik gemaakt van dia s uit presentaties van o.a. Anastasios Kesidis, CIL, Athene Griekenland, en Asaf Tzadok, IBM Haifa Research Lab IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Er is door mij gebruik gemaakt van dia s uit presentaties

More information

ELPUB Digital Library v2.0. Application of semantic web technologies

ELPUB Digital Library v2.0. Application of semantic web technologies ELPUB Digital Library v2.0 Application of semantic web technologies Anand BHATT a, and Bob MARTENS b a ABA-NET/Architexturez Imprints, New Delhi, India b Vienna University of Technology, Vienna, Austria

More information

Stakeholder Communication in Software Project Management. Modelling of Communication Features

Stakeholder Communication in Software Project Management. Modelling of Communication Features Stakeholder Communication in Software Project Management. Modelling of Communication Features IOAN POP * and ALEXANDRA-MIHAELA POP ** * Department of Mathematics and Informatics ** Department of Industrial

More information

C19 Machine Learning

C19 Machine Learning C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

More information

Language technologies for Education: recent results by the MLLP group

Language technologies for Education: recent results by the MLLP group Language technologies for Education: recent results by the MLLP group Alfons Juan 2nd Internet of Education Conference 2015 18 September 2015, Sarajevo Contents The MLLP research group 2 translectures

More information

AN APPROACH FOR SUPERVISED SEMANTIC ANNOTATION

AN APPROACH FOR SUPERVISED SEMANTIC ANNOTATION AN APPROACH FOR SUPERVISED SEMANTIC ANNOTATION A. DORADO AND E. IZQUIERDO Queen Mary, University of London Electronic Engineering Department Mile End Road, London E1 4NS, U.K. E-mail: {andres.dorado, ebroul.izquierdo}@elec.qmul.ac.uk

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Optimization of Image Search from Photo Sharing Websites Using Personal Data

Optimization of Image Search from Photo Sharing Websites Using Personal Data Optimization of Image Search from Photo Sharing Websites Using Personal Data Mr. Naeem Naik Walchand Institute of Technology, Solapur, India Abstract The present research aims at optimizing the image search

More information

MLeXAI Workshop. Ingrid Russell, Zdravko Markov. Sanibel Island, FL, May 18, 2009

MLeXAI Workshop. Ingrid Russell, Zdravko Markov. Sanibel Island, FL, May 18, 2009 MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009 Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Dr. Antony Selvadoss Thanamani, Head & Associate Professor, Department of Computer Science, NGM College, Pollachi, India.

Dr. Antony Selvadoss Thanamani, Head & Associate Professor, Department of Computer Science, NGM College, Pollachi, India. Enhanced Approach on Web Page Classification Using Machine Learning Technique S.Gowri Shanthi Research Scholar, Department of Computer Science, NGM College, Pollachi, India. Dr. Antony Selvadoss Thanamani,

More information

Personalized Expedia Hotel Searches

Personalized Expedia Hotel Searches Personalized Expedia Hotel Searches Xinxing Jiang, Yao Xiao, and Shunji Li Stanford University December 13, 2013 Abstract In this paper, we propose machine learning algorithms with search data of Expedia

More information

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance Shen Wang, Bin Wang and Hao Lang, Xueqi Cheng Institute of Computing Technology, Chinese Academy of

More information

Web Content Summarization Using Social Bookmarking Service

Web Content Summarization Using Social Bookmarking Service ISSN 1346-5597 NII Technical Report Web Content Summarization Using Social Bookmarking Service Jaehui Park, Tomohiro Fukuhara, Ikki Ohmukai, and Hideaki Takeda NII-2008-006E Apr. 2008 Jaehui Park School

More information

User research for information architecture projects

User research for information architecture projects Donna Maurer Maadmob Interaction Design http://maadmob.com.au/ Unpublished article User research provides a vital input to information architecture projects. It helps us to understand what information

More information

Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008

Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008 Best Practices for Structural Metadata Version 1 Yale University Library June 1, 2008 Background The Digital Production and Integration Program (DPIP) is sponsoring the development of documentation outlining

More information

Robust Question Answering for Speech Transcripts: UPC Experience in QAst 2009

Robust Question Answering for Speech Transcripts: UPC Experience in QAst 2009 Robust Question Answering for Speech Transcripts: UPC Experience in QAst 2009 Pere R. Comas and Jordi Turmo TALP Research Center Technical University of Catalonia (UPC) {pcomas,turmo}@lsi.upc.edu Abstract

More information

Study and Analysis of Data Mining Concepts

Study and Analysis of Data Mining Concepts Study and Analysis of Data Mining Concepts M.Parvathi Head/Department of Computer Applications Senthamarai college of Arts and Science,Madurai,TamilNadu,India/ Dr. S.Thabasu Kannan Principal Pannai College

More information

Neelamadhav Gantayat. Prof. Sridhar Iyer. June 28, 2011

Neelamadhav Gantayat. Prof. Sridhar Iyer. June 28, 2011 Automated Construction Of Domain Ontologies From Lecture Notes Neelamadhav Gantayat under the guidance of Prof. Sridhar Iyer Department of Computer Science and Engineering, Indian Institute of Technology,

More information

FHWA Office of Operations (R&D) RDE Release 3.0 Potential Enhancements. 26 March 2014

FHWA Office of Operations (R&D) RDE Release 3.0 Potential Enhancements. 26 March 2014 FHWA Office of Operations (R&D) RDE Release 3.0 Potential Enhancements 26 March 2014 Overview of this Session Present categories for potential enhancements to the RDE over the next year (Release 3.0) or

More information

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools Paper by W. F. Cody J. T. Kreulen V. Krishna W. S. Spangler Presentation by Dylan Chi Discussion by Debojit Dhar THE INTEGRATION OF BUSINESS INTELLIGENCE AND KNOWLEDGE MANAGEMENT BUSINESS INTELLIGENCE

More information

International Journal of Electronics and Computer Science Engineering 1449

International Journal of Electronics and Computer Science Engineering 1449 International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak 9.6.2015

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak 9.6.2015 Computer-Based Text- and Data Analysis Technologies and Applications Mark Cieliebak 9.6.2015 Data Scientist analyze Data Library use 2 About Me Mark Cieliebak + Software Engineer & Data Scientist + PhD

More information

Application of ontologies for the integration of network monitoring platforms

Application of ontologies for the integration of network monitoring platforms Application of ontologies for the integration of network monitoring platforms Jorge E. López de Vergara, Javier Aracil, Jesús Martínez, Alfredo Salvador, José Alberto Hernández Networking Research Group,

More information

The University of Amsterdam s Question Answering System at QA@CLEF 2007

The University of Amsterdam s Question Answering System at QA@CLEF 2007 The University of Amsterdam s Question Answering System at QA@CLEF 2007 Valentin Jijkoun, Katja Hofmann, David Ahn, Mahboob Alam Khalid, Joris van Rantwijk, Maarten de Rijke, and Erik Tjong Kim Sang ISLA,

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

Flattening Enterprise Knowledge

Flattening Enterprise Knowledge Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it

More information

XML-BASED INTEGRATION: A CASE STUDY

XML-BASED INTEGRATION: A CASE STUDY XML-BASED INTEGRATION: A CASE STUDY Chakib Chraibi, Barry University, cchraibi@mail.barry.edu José Ramirez, Barry University, jramirez@mail.barry.edu Andrew Seaga, Barry University, aseaga@mail.barry.edu

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED 17 19 June 2013 Monday 17 June Salón de Actos, Facultad de Psicología, UNED 15.00-16.30: Invited talk Eneko Agirre (Euskal Herriko

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

Ontology-Based Multilingual Information Retrieval

Ontology-Based Multilingual Information Retrieval Ontology-Based Multilingual Information Retrieval Jacques Guyot * Saïd Radhouani *,** Gilles Falquet * * Centre universitaire d informatique 24, rue Général-Dufour, CH-1211 Genève 4, Switzerland ** Laboratoire

More information

Automated Content Analysis of Discussion Transcripts

Automated Content Analysis of Discussion Transcripts Automated Content Analysis of Discussion Transcripts Vitomir Kovanović v.kovanovic@ed.ac.uk Dragan Gašević dgasevic@acm.org School of Informatics, University of Edinburgh Edinburgh, United Kingdom v.kovanovic@ed.ac.uk

More information

Open Source IR Tools and Libraries

Open Source IR Tools and Libraries Open Source IR Tools and Libraries Giorgos Vasiliadis, gvasil@csd.uoc.gr CS-463 Information Retrieval Models Computer Science Department University of Crete 1 Outline Google Search API Lucene Terrier Lemur

More information

Inner Classification of Clusters for Online News

Inner Classification of Clusters for Online News Inner Classification of Clusters for Online News Harmandeep Kaur 1, Sheenam Malhotra 2 1 (Computer Science and Engineering Department, Shri Guru Granth Sahib World University Fatehgarh Sahib) 2 (Assistant

More information

Open Source Techniques push Enterprise Search & Search Driven Applications and especially foster the application of Text Analytics

Open Source Techniques push Enterprise Search & Search Driven Applications and especially foster the application of Text Analytics Open Source Techniques push Enterprise Search & Search Driven Applications and especially foster the application of Text Analytics Exploring the Future of Enterprise Search, IPTS, Seville, October 2011

More information

Sentiment Analysis on Big Data

Sentiment Analysis on Big Data SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social

More information

Evolutionary Algorithms using Evolutionary Algorithms

Evolutionary Algorithms using Evolutionary Algorithms Evolutionary Algorithms using Evolutionary Algorithms Neeraja Ganesan, Sowmya Ravi & Vandita Raju S.I.E.S GST, Mumbai, India E-mail: neer.ganesan@gmail.com, Abstract A subset of AI is, evolutionary algorithm

More information

Mining an Online Auctions Data Warehouse

Mining an Online Auctions Data Warehouse Proceedings of MASPLAS'02 The Mid-Atlantic Student Workshop on Programming Languages and Systems Pace University, April 19, 2002 Mining an Online Auctions Data Warehouse David Ulmer Under the guidance

More information

Segmentation and Classification of Online Chats

Segmentation and Classification of Online Chats Segmentation and Classification of Online Chats Justin Weisz Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 jweisz@cs.cmu.edu Abstract One method for analyzing textual chat

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Web Content Mining. Dr. Ahmed Rafea

Web Content Mining. Dr. Ahmed Rafea Web Content Mining Dr. Ahmed Rafea Outline Introduction The Web: Opportunities & Challenges Techniques Applications Introduction The Web is perhaps the single largest data source in the world. Web mining

More information