Exam in course TDT4215 Web Intelligence - Solutions and guidelines -



Similar documents
1 o Semestre 2007/2008

Search and Information Retrieval

Experiments in Web Page Classification for Semantic Web

Clustering Technique in Data Mining for Text Documents

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Social Media Mining. Data Mining Essentials

Search Result Optimization using Annotators

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Query Recommendation employing Query Logs in Search Optimization

Mining Text Data: An Introduction

Eng. Mohammed Abdualal

Search Engines. Stephen Shaw 18th of February, Netsoc

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Text Analytics Illustrated with a Simple Data Set

Clustering Connectionist and Statistical Language Processing

Semantic Search in Portals using Ontologies

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Electronic Document Management Using Inverted Files System

An Information Retrieval using weighted Index Terms in Natural Language document collections

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Taxonomy learning factoring the structure of a taxonomy into a semantic classification decision

Identifying Focus, Techniques and Domain of Scientific Papers

Information Retrieval and Web Search Engines

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data, Measurements, Features

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1

Automated News Item Categorization

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Linear Algebra Methods for Data Mining

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Enabling Semantic Search in Geospatial Metadata Catalogue to Support Polar Sciences

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

A Statistical Text Mining Method for Patent Analysis

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

EUR-Lex 2012 Data Extraction using Web Services

Domain Classification of Technical Terms Using the Web

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

6-1. Process Modeling

Web 3.0 image search: a World First

Ontology based ranking of documents using Graph Databases: a Big Data Approach

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Semantic Web based e-learning System for Sports Domain

Explorer's Guide to the Semantic Web

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval on the Internet

Term extraction for user profiling: evaluation by the user

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

International Journal of Advance Research in Computer Science and Management Studies

Mining a Corpus of Job Ads

Machine Learning using MapReduce

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

HELP DESK SYSTEMS. Using CaseBased Reasoning

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Efficient Query Optimizing System for Searching Using Data Mining Technique

A HUMAN RESOURCE ONTOLOGY FOR RECRUITMENT PROCESS

A COMBINED TEXT MINING METHOD TO IMPROVE DOCUMENT MANAGEMENT IN CONSTRUCTION PROJECTS

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Data Mining Yelp Data - Predicting rating stars from review text

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

The Seven Practice Areas of Text Analytics

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Probabilistic Latent Semantic Analysis (plsa)

Introduction to Data Mining

Information Technology for KM

An Overview of Knowledge Discovery Database and Data mining Techniques

Automated Collaborative Filtering Applications for Online Recruitment Services

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Identifying SPAM with Predictive Models

Chapter 6. The stacking ensemble approach

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

How To Cluster On A Search Engine

CSCI 5417 Information Retrieval Systems Jim Martin!

Transcription:

English Student no:... Page 1 of 12 Contact during the exam: Geir Solskinnsbakk Phone: 94218 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Friday May 21, 2010 Time: 0900-1300 Allowed means of assistance: D No printed or handwritten material allowed. Simple calculator allowed. The weighting of the questions is indicated by percentages. All answers are to be entered directly in the designated boxes on the question sheet and no additional sheets of paper are to be handed in. For the multiple choice questions, you should set a cross in the box for the correct answer. There is only one correct answer for the multiple choice questions. A list of formulas from the book and the papers/chapters are included at the end of the exam set. Question 1. Vector Space model (25%) Assume a document collection with the following four documents: Document #1: Trondheim is the third largest city in Norway and is situated on the Nidelva river. (Length = 0.89) Document #2: Trondheim is a university city with many students. Document #3: The Nidaros cathedral is a popular tourist attraction. (Length = 1.13) Document #4: Nidaros was the original name of Trondheim. Trondheim is the modern name. (Length = 0.89) Small letters and capital letters are treated as the same. Do not perform any stemming, lemmatization or stop word removal. a) Explain the idea behind the cosine similarity in vector space information retrieval (2.5%).

English Student no:... Page 2 of 12 2.5.3 in Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval b) Explain the motivation behind the tf idf formula and each of its components (2.5%). 2.5.3 in Baeza-Yates & Ribeiro-Neto: Modern Information Retrieval c) Calculate the length of document #2. Show the calculation (5%). Vocabulary: Doc 1 Doc 2 Doc 3 Doc 4 n log(n/n) di=tfidf di^2 a 1 1 2 0,30103 0,30103 0,090619 city 1 1 2 0,30103 0,30103 0,090619 is 2 1 1 1 4 0 0 0

English Student no:... Page 3 of 12 many 1 1 0,60206 0,60206 0,362476 students 1 1 0,60206 0,60206 0,362476 Trondheim 1 1 2 3 0,124939 0,124939 0,01561 university 1 1 0,60206 0,60206 0,362476 with 1 1 0,60206 0,60206 0,362476 Square Maxfreq 1 sum: 1,646753 (di^2)= 1.646753 ( (di^2)) 1.28 Length of document #2 is 1.28. d) Calculate the tfidf score for the terms trondheim and nidaros for each of the documents (5%). Nidaros: Idf(nidaros) = log(4/2) D1,D2 = 0 (does not contain nidaros) D3: tfidf(nidaros) = 1/1*log(4/2) = 0.301 (maxfreq = 1) D4: tfidf(nidaros)= ½*log(4/2) = 0.151 (maxfreq=2) Trondheim Idf(Trondheim) = log(4/3) D1: tfidf = ½*log(4/3) = 0.062 (maxfreq = 2)

English Student no:... Page 4 of 12 D2: Tfidf(trondheim) = 1/1*log(4/3) = 0.125 (maxfreq = 1) D3: Tfidf(trondheim) = 0 (does not contain trondheim) D4: Tfidf(Trondheim) = 2/2*log(4/3) = 0.125 e) Calculate the cosine similarity for each of the documents with respect to the query (10%). Query: trondheim nidaros Show the list of ranked documents. Query: W i,q =(0.5+0.5*tf i,q /maxf i,q )*log(n/n) W nidaros,q = (0.5 + 0.5*1/1) * log(4/2) = 0.301 W trondheim,q = (0.5 + 0.5*1/1) * log(4/3) = 0.125 Length og query: L q = (0.301^2 + 0.125^2) = 0.326 D1: Tfidf nidaros = 0 Tfidf trondheim =0.062 Length = 0.89 D2: Tfidf nidaros = 0

English Student no:... Page 5 of 12 Tfidf trondheim =0.125 Length = 1.28 D3: Tfidf nidaros = 0.301 Tfidf trondheim =0 Length = 1.13 D4: Tfidf nidaros = 0.151 Tfidf trondheim =0.125 Length = 0.89 Sim(d1,q) = ( 0 * 0.301 + 0.062 * 0.125)/(0.89*0.326) = 0,0267 Sim(d2,q) = ( 0 * 0.301 + 0.125 * 0.125)/(1.28*0.326) = 0,0374 Sim(d3,q) = ( 0.301 * 0.301 +0 * 0.125)/(1.13*0.326) = 0,2459 Sim(d4,q) = ( 0.151 * 0.301 + 0.125 * 0.125)/(0.89*0.326) = 0,2105 Ranking: D3 > D4 > D2 > D1 Question 2. Retrieval Evaluation (10%) a) Given a set of relevant documents Rq for a given query q, and the returned (ranked) result from an experimental information retrieval engine for query q, Aq, calculate the interpolated precision at 11 standard recall levels. You are not required to graph the result. (5%) Rq = {D4, D12, D23, D26, D30, D33, D60, D73, D83, D99} Aq = 1. D4, 2. D12, 3. D2, 4. D33, 5. D31, 6. D83, 7. D41, 8. D62, 9. D23, 10. D51, 11. D9, 12. D43, 13. D73, 14. D55, 15. D17 recall precision 0 % 100 % 10 % 100 % 20 % 100 %

English Student no:... Page 6 of 12 30 % 75 % 40 % 67 % 50 % 56 % 60 % 46 % 70 % 0 % 80 % 0 % 90 % 0 % 100 % 0 % b) What is one of the biggest problems of precision/recall in web information retrieval, and which measure out of the two is more important to a user in web search? (2.5%) We do not have detailed knowledge of the document collection searched; and hence no exact definition of the set of related documents for a given query. This leads to only estimation of the correct set of documents. Precision will most likely be the most important measure, since the user is interested in getting valuable hits high in the ranking. c) Calculate the R precision for the query and the ranked list given in a). (2.5%) Number of relevant documents = 10. In the first 10 documents retrieved, there are 5 relevant documents; R-precision is 5/10 = 0.5 (50%). Question 3. Text Categorization (20%) a) What is the basic hypothesis in using the vector space model for classification and what is the content of the hypothesis? (5%) The basic hypothesis in using the vector space model for classification is the contiguity hypothesis. Contiguity hypothesis: Documents in the same class form a contiguous region and regions of

English Student no:... Page 7 of 12 different classes do not overlap. b) Explain briefly the rationale of k Nearest Neighbor (knn) text classification. (5%) The rationale of knn classification is that, based on the contiguity hypothesis, we expect a test document d to have the same label as the training documents located in the local region surrounding d. c) The table below shows a test set consisting 10 documents, where each document is annotated with its corresponding class (A,B,C). We wish to use this test set to find which class document X belong to. The distance between document X and the documents in the test set is given in the table. Use majority voting scheme knn with k=5 to calculate which class document X belongs to. (5%) Document 1 2 3 4 5 6 7 8 9 10 Class A A B A C C B C B B Distance from X 5.31 2.89 0.63 1.12 3.52 4.31 2.33 1.35 0.89 2.15 The five closest documents are: 3(B), 4(A), 8(C), 9(B), 10(B). B has three out of 5 closest documents, while A and C only have 1 out of the 5 closest documents. Document X is annotated with class B. d) With the table given in question c), use weighted sum voting scheme knn with k=5 to calculate which class document X belongs to. (5%) The five closest documents are: 3(B), 4(A), 8(C), 9(B), 10(B). The maximum distance 2.15 is used for normalization and calculation of weights. A=1 1.12/2.15=0.479 B=(1 0.63/2.15) + (1 0.89/2.15) + (1 2.15/2.15)=1.293 C=1 1.35/2.15=0.37

English Student no:... Page 8 of 12 Document X is annotated with class B Question 4. Clustering (15%) a) Explain the main steps of the k means clustering algorithm (5%) See chapter 4 in Chakrabarti: Mining the web discovering knowledge from hypertext data. b) Assume the four documents below. Use the suffix tree clustering (STC) method explained in Contextualized Clustering in Exploratory Web Search and build a suffix tree for the documents (5%). Doc 1: {gustave eiffel} Doc 2: {eiffel, paris sightseeing} Doc 3: {eiffel tower, paris attractions} Doc 4: {alexandre gustave eiffel, paris} Alexandre gustaveeiffel [4] gustaveeiffel [1,4] paris [4] sightseeing [2] attractions [3] eiffel[1,2,4] tower [3] sightseeing [2] attractions [3] tower [3]

English Student no:... Page 9 of 12 c) Identify the three base clusters and calculate their scores (5%). We use formula s(b) = B * f( P ) Base cluster Documents Score eiffel 1,2,4 3 paris 2,3,4 3 gustave eiffel 1,4 4 Question 5. Latent Semantic Analysis (5%) Explain briefly the purpose of Singular Value Decomposition in latent semantic analysis (5%) See paper on latent semantic analysis Question 6. Semantic Web (15%) a) What role does XML play in the Semantic Web? (3%) Definition of OWL Serialization of OWL ontology models = TRUE Unified format for ontology mapping Representation of ontology layout Semantic markup language

English Student no:... Page 10 of 12 b) Which is the most expressive language that retains computational completeness and decidability? (3%) XML RDF RDFS OWL DL = TRUE OWL Full c) Below is an OWL definition described using the abstract syntax from Antoniou et al. SubClassOf(student complete intersectionof(human smart) ObjectProperty(hasStudNo InverseFunctional) Class(student partial restriction(read somevaluefrom(book)) restriction(read allvaluesfrom(scientificbook))) Individual(john type(student)) Individual(john value(hasstudno 12345)) Individual(Mary value(hasstudno 12345)) Phrase these definitions in natural language. (3%) All students are smart humans. Things have unique student numbers. Students must read at least one book and they only read scientific books. John is a student. Both John and Mary have student number 12345. d) What cannot be inferred from the statements in c)? (3%) Mary is a student John and Mary are the same individuals Mary reads only scientific books John is smart Each student has exactly one student number = TRUE 100%: Correctly answered otherwise 0%

English Student no:... Page 11 of 12 e) Translate the following definition of professors into the OWL abstract syntax (3%) Scientific staff are either professors or researchers. Scientific staff teach courses Professors teach at the most one course each. SubClassOf(scientificStaff unionof(professor researcher)) ObjectProperty(teach domain(scientificstaff) range(course)) Class(professor partial restriction(teach maxcardinality(1))) Question 7. Semantic Search (10%) a) Recall the article "Indexing a Web Site with a Terminology Oriented Ontology by Demontils et al., they mention "cumulative similarity measure" by Demontils & Jacquin. Why did they stem instead of lemmatize the words before they were annotated with part of speech tags? (2.5%) Stemming is the process to find the common root form of words and hence an ideal basis for a part of speech tagger; Stemming is the process to find the canonical form of words and hence an ideal basis for a part of speech tagger; Lemmatization is the process to find the canonical form of words and hence not suitable as a good basis for a part of speech tagger; Lemmatization is the process to find the common root form of words and hence not suitable as a good basis for a part of speech tagger; They did not use stemming, but lemmatization. = TRUE 100%: Correctly answered otherwise 0% b) Recall the article "Construction of Ontology based Semantic Linguistic Feature Vectors for Searching: the Process and Effect" by Tomassen & Strasunskas. The development of the approach is inspired by a linguistic method for describing the meaning of objects, where a FV "connects" something. What does a feature vector "connect"? (2.5%) An ontology property with associated textual documents; An ontology property with associated domain terminology;

English Student no:... Page 12 of 12 An ontology entity with associated textual documents; An ontology entity with associated domain terminology; = TRUE An ontology with associated textual documents; An ontology with associated domain terminology. 100%: Correctly answered otherwise 0% c) Recall the article "Measuring intrinsic quality of semantic search based on Feature Vectors" by Tomassen & Strasunskas. A set of evaluation measures were proposed. What is the purpose of the Average Fv Similarity measure? (2.5%) Provides an indication of the uniqueness of the FVs; = TRUE Provides an indication of the degree of semantic relatedness between neighbouring entities; Provides an indication of the overall quality of the FVs; Provides an indication of the FV quality relative to the ontology quality; Provides an indication of the semantic distance between the FV terms. 100%: Correctly answered otherwise 0% d) What are ontologies normally not used for in semantic search applications? (2.5%) Disambiguation of queries Query expansion with synonyms Semantic indexing with ontology concepts Latent semantic indexing = TRUE Semantic annotation of documents 100%: Correctly answered otherwise 0% Appendix A. Formulas