Methods for accessing permanent information and their evaluation

Size: px
Start display at page:

Download "Methods for accessing permanent information and their evaluation"

Transcription

1 Methods for accessing permanent information and their evaluation Information Management Systems Research Group Department of Information Engineering University of Padua Workshop: Postdoctoral Research in Informatics 08 July 2015, Centro congressi A. Luciani, Padova

2 Main Aspects of Interest Data Model Data Access and Sharing Evaluation Namespace Identifiable Resource Resource ims:ispartof Evaluation Activity ims:consistsof Track ims:iscomposedby ims:submittedto Run Concept User ims:evaluates ims:expressassessment Quality Parameter ims:isevaluatedby Measurement ims:isassociatedto ims:isassignedto ims:ismeasuredby ims:describes Descriptive Statistic ims:isassignedto Measure Statistic 2

3 Structure, Access, Query, Evaluation Unstructured Best-Match Natural Language Keywords Effectiveness Accuracy XPath/XQuery Precision Semi- Structured Hybrid Recall SPARQL Structured Relational database Exact-Match Efficiency Time Space Data Search/Access Paradigm Query Evaluation 3

4 The use of semi-structured data (XML) - The use of XML is wide-spread in many sectors of everyday life - Cultural heritage data: libraries, archives and museums - Health data: protein sequences, pharmaceutical research - Geographical data - Linguistics data: Treebank, part-of-speech tagging and annotation - Heterogeneous scientific datasets 4

5 NESTOR: Set-Based Approach to Access Hierarchical Data <a1> text <a2> <a4> text </a4> <a5> text </a5> <a6> text </a6> </a2> <a3> text XML </a3> </a1> a 1 av 2 a 3 av 4 av 5 av 6 Tree d A 2 A 4 A 5 e b A 1 a g A 6 f c A 3 A 5 A 4 d c b a f g e A 1 A 2 A 3 A 6 Nested Sets Model Inverse Nested Sets Model M. Agosti, N. Ferro, and G. Silvello (2011). Handling Hierarchically Structured N. Ferro and G. Silvello (2013). NESTOR: A Formal Model Resources Addressing Interoperability Issues in Digital Libraries. for Digital Archives. Information Processing & Management Studies in Computational Intelligence vol. 375, pp Springer (IP&M), 49(6): , Elsevier 5

6 Efficient Implementations of NESTOR XPath Axes Descendants Ancestors Children Parent Set-Wise and Element-Wise Query Primitives Data Structures Direct Data Structure (DDS) Inverse Data Structure (IDS) Hybrid Data Structure (HDS) NESTOR Model Nested Sets Model (NSM) Inverse Nested Sets Model (INSM) Structure DDS IDS HDS Descendants O(m) O(1) O(m) Ancestors O(1) O(m) O(m) Parent O(1) O(m) O(1) Children O(1) O(1) O(1) Content DDS IDS HDS Descendants O(1) O(m+n) O(m+n) Ancestors O(m+n) O(1) O(m+n) Parent O(n) O(m+n) O(n) Children O(n) O(n) O(1) 6

7 Efficiency Evaluation: Space and Time Execution times XPath-Based Queries over Wikipedia from INEX (à la TPC benchmark) Descendants Element-Wise INEX Primitive Evaluation - Collaborative Time Knowledge msec, log scale DDS IDS HDS 10 1 Xalan Jaxen JXpath BaseX Structure query templates Descendants Union Descendants Intersection Descendants 7

8 Efficiency Evaluation: Space and Time Average Index Building Time (Wikipedia XML files) Average Occupied Memory (Wikipedia XML file) 8

9 The Other Side of Evaluation: Effectiveness - User-oriented evaluation - From structured queries to information needs SELECT name FROM hotel WHERE city= Padua I need a comfortable accommodation in Padua, Italy - From set of results to ranked lists ordered by relevance Methis Hotel Toscanelli Best Western Premier Sheraton Hotel 9

10 Effectiveness-Oriented Evaluation - Evaluation is a demanding activity carried out in international evaluation campaigns to share the effort and compare the experiments - We designed a visual analytics tool for easing the evaluation work and reduce the required effort to carry out performance, failure and what-if analyses M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2014). A Visual Tool for Information Retrieval Performance Evaluation and Failure Analysis, Journal of Visual Languages and Computing, 25(4): , Elsevier. 10

11 Effectiveness-Oriented Evaluation - Most common effectiveness measures evaluate the user achieved utility - We propose the Twist measure to evaluate the effort required to the user Effort vs Gain: TREC 10, 2001, Web (a) Huge effort (b) High effort (c) Medium effort (d) Low effort ndcg N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola and K. Järvelin (2015). The Twist Measure for IR Evaluation: Taking User s Effort into Account, Journal of the Association for Information Science and Technology, John Wiley & Sons, Inc. (in print) Twist TREC 10, 2001, Web jscbtawtl4, topic jscbtawtl4, topic 539 hum01t, topic 544 jscbtawtl4, topic 544 pir1wt2, topic 504 TREC 10, 2001, Web hum01t, topic

12 Experimental Evaluation and Reproducibility - Experimental evaluation enables repeatability, reproducibility and generalization of the experiments repeatability reproducibility - Experiments and findings are connected to scientific papers describing them: actionable papers N. Ferro and G. Silvello (2015). Rank-Biased Precision Reloaded: Reproducibility and Generalization. Proc of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp Springer 12

13 Actionable Papers <img src= 017c333a-4b7c d-f15fe3554efd/ 177bcef2-00a0-4f59-b781-f285610f1c6f <a href= 017c333a-4b7c d-f15fe3554efd > 13

14 Data Citation Hierarchical Data (XML) Citation Graph Data (RDF) Citation Instantiation of the variables: {database=$d, version=$v, contributors=$c, db-family=$n, family=$f, idfamily=$i} <Iuphar> <name>iuphar-db </name> <citation>rule0</citation> [...] <gpcr> <name>g protein-coupled receptors</name> <citation>rule1</citation> [...] <family> <id>29</id> <name>glucagon receptor family</name> <citation>rule2</citation> <receptor> <id>247</id> <name>ghrh</name> [...] <agonists> <ligand> [...] </ligand> </agonists> [...] </receptor> [...] </family> [...] </gpcr> <ionchannels> [...] </ionchannels> </iuphar> Rules: iuphar[name=$.d,url=$.u, version=$.v] The third rule interpreted by the system iuphar[]/gpcr[name=$.n] The second rule interpreted by the system iuphar[]/gpcr[]/family[name=$.f,id=$.i] /contributors[]/contributor[name=$?c] The first rule interpreted by the system The rules are recursively processed by the system and then transformed into a conjunction of XPaths. The interpretation of the XPaths generates the citation. The citation that gets generated (example): { database=iuphar-db: the IUPHAR database url= version=15 dbfamily=g protein-coupled receptors family=glucagon receptor family idfamily=29 contributor= {Laurence J. Miller;;Daniel J. Drucker;;[...];;Rebecca Hills;;}} cit-sysa- CLEF2009 n4 ex: n1 ex: n4 n1 ex: systema precision n dc:creator dc:creator dc:date dc:title Human-readable reference John Doe and Marco Rossi, "SystemA performances at CLEF 2009", 08 July 2014, < schema: is-related-to schema: is-related-to ex:produce schema: is-related-to ex:name ex:value John Doe Marco Rossi SystemA performances at CLEF 2009 ex: expa ex: n3 ex: n2 schema: is-related-to ex: n5 ex:measure ex:submitted-to ex: measurea n2 Machine-readable reference Subject Property Object Name ex:cit-sysa-clef2009 dc:creator "John Doe" < ex:cit-sysa-clef2009 dc:creator "Marco Rossi < ex:cit-sysa-clef2009 dc:date " " < ex:cit-sysa-clef2009 dc:title "SystemA..." < ex: CLEF 2009 Copyright 2015 Machine-readable citation meta-graph n3 Subject Property Object Name ex:n1 schema:is-related-to ex:n2 ex:cit-sysa-clef2009 ex:n1 schema:is-related-to ex:n3 ex:cit-sysa-clef2009 ex:n2 schema:is-related-to ex:n4 ex:cit-sysa-clef2009 ex:n2 schema:is-related-to ex:n5 ex:cit-sysa-clef2009 Original cited LOD subset Subject Property Object Name ex:systema ex:produce ex:expa ex:n1 ex:expa ex:measure ex:measurea ex:n2 ex:expa ex:submitted-to ex:clef2009 ex:n3 ex:measurea ex:name "precision" ex:n4 ex:measurea ex:value "0.7" ex:n5 P. Buneman and G. Silvello (2010). A Rule-Based Citation System for Structured and Evolving Datasets. Bulletin of the Technical Committee on Data Engineering, 3(3):33 41 Copyright 2015 G. Silvello (2015). A Methodology for Citing Linked Open Data Subsets, D-Lib Magazine, 21(1/2). 14

15 Future Directions Data modeling, sharing and enriching via the Linked (Open) Data paradigm Data citation methods for evolving datasets Effectiveness-oriented evaluation of keyword-based system over structured data 15

16 Selected Publications - N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola and K. Järvelin (2015). The Twist Measure for IR Evaluation: Taking User s Effort into Account, Journal of the Association for Information Science and Technology, in print. - G. Silvello (2015). A Methodology for Citing Linked Open Data Subsets, D-Lib Maga- zine, 21(1/2). DOI: /january2015-silvello - M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2014). A Visual Tool for Information Retrieval Performance Evaluation and Failure Analysis, Journal of Visual Languages and Computing, 25(4): E. Di Buccio, G. Di Nunzio, and G. Silvello (2014). A Linked Open Data Approach for Geolinguistics Applications, International Journal of Metadata, Semantics and Ontologies, 9(1): N. Ferro and G. Silvello (2013). NESTOR: A Formal Model for Digital Archives. Information Processing & Management (IP&M), 49(6): E. Di Buccio, G. Di Nunzio and G. Silvello (2013). A Curated and Evolving Linguistic Linked Dataset. Semantic Web, 4(3): , P. Hitzler and K. Janowicz eds. - P. Buneman and G. Silvello (2010). A Rule-Based Citation System for Structured and Evolving Datasets. IEEE Bulletin of the Technical Committee on Data Engineering, 2010, 3(3):

17 Selected Publications - N. Ferro and G. Silvello (2015). Rank-Biased Precision Reloaded: Reproducibility and Generalization. In Proc. of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp Springer. - M. Angelini, N. Ferro, G. Santucci and G. Silvello (2015). Tutorial: Visual Analytics for Information Retrieval Evaluation (VAIRË 2015). In Proc. of the 37th European Conference on Information Retrieval (ECIR 2015), LNCS 9022, pp Springer. - N. Ferro and G. Silvello (2014). CLEF 15th Birthday: What Can We Learn From Ad Hoc Retrieval? In Proc. of the Information Access Evaluation. Multilinguality, Multimodality, and Interaction - 5th International Conference of the Cross-Language Evaluation Forum (CLEF 2014), LNCS 8685, pp Springer. - M. Angelini, N. Ferro, G. Santucci and G. Silvello (2014). A Visual Interactive Environment for Making Sense of Experimental Data. 36th European Conference on Information Retrieval (ECIR 2014), Lecture Notes in Computer Science 8416, pp , Springer - E. Di Buccio, G. M. Di Nunzio and G. Silvello (2013). A Geolinguistic Web Application Based on Linked Open Data. Proc. of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 13) pp ACM, New York, NY, USA - N. Ferro and G. Silvello (2013). Formal Models for Digital Archives: NESTOR and the 5S. In: Proc. of the Research and Advanced Technology for Digital Libraries - International Conference on Theory and Practice of Digital Libraries (TPDL 2013), LNCS 8092, pp Springer 17

18 Selected Publications - M. Angelini, N. Ferro, G. Granato, G. Santucci and G. Silvello (2012). Information retrieval failure analysis: Visual analytics as a support for interactive what-if investigation. In: IEEE Conference on Visual Analytics Science and Technology, VAST 2012, pp IEEE Computer Society, USA - M. Angelini, N. Ferro, G. Santucci, and G. Silvello (2012). Visual Interactive Failure Analysis. In Proc. of the Fourth Information Interaction in Context Symposium (IIiX 2012). ACM Press, New York, USA - M. Agosti, N. Ferro, and G. Silvello (2011). Handling Hierarchically Structured Re- sources Addressing Interoperability Issues in Digital Libraries. In Learning Structure and Schemas from Documents. Studies in Computational Intelligence vol. 375, pp N. Ferro and G. Silvello (2009). The NESTOR Framework: How to Handle Hierarchical Data Structures. In Proc. of the 13th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2009), LNCS 5741, pp Springer-Verlag. - M. Agosti, N. Ferro and G. Silvello (2009). Access and Exchange of Hierarchically Structured Resources on the Web with the NESTOR Framework. In Proc. of the IEEE/WIC/ ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, 2009, pp IEEE Computer Society. 18