The unprecedented availability of. Querying Heterogeneous Datasets on the Linked Data Web. Challenges, Approaches, and Trends



Similar documents
Evaluation of a layered approach to question answering over linked data

Semantic Search in Portals using Ontologies

DISCOVERING RESUME INFORMATION USING LINKED DATA

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Discovering and Querying Hybrid Linked Data

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Leveraging existing Web frameworks for a SIOC explorer to browse online social communities

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

How To Make Sense Of Data With Altilia

Semantic Interoperability

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

Serendipity a platform to discover and visualize Open OER Data from OpenCourseWare repositories Abstract Keywords Introduction

Federated Data Management and Query Optimization for Linked Open Data

Towards a reference architecture for Semantic Web applications

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Text Analytics with Ambiverse. Text to Knowledge.

Natural Language to Relational Query by Using Parsing Compiler

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

HybIdx: Indexes for Processing Hybrid Graph Patterns Over Text-Rich Data Graphs Technical Report

María Elena Alvarado gnoss.com* Susana López-Sola gnoss.com*

Publishing Linked Data Requires More than Just Using a Tool

Towards a Sales Assistant using a Product Knowledge Graph

Semantically Enhanced Web Personalization Approaches and Techniques

ONTOLOGY BASED FEEDBACK GENERATION IN DESIGN- ORIENTED E-LEARNING SYSTEMS

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

Application of ontologies for the integration of network monitoring platforms

How To Write A Drupal Rdf Plugin For A Site Administrator To Write An Html Oracle Website In A Blog Post In A Flashdrupal.Org Blog Post

Text Mining - Scope and Applications

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Ontology based ranking of documents using Graph Databases: a Big Data Approach

Mining the Web of Linked Data with RapidMiner

OWL based XML Data Integration

A generic approach for data integration using RDF, OWL and XML

Characterizing Knowledge on the Semantic Web with Watson

DLDB: Extending Relational Databases to Support Semantic Web Queries

KEYWORD SEARCH IN RELATIONAL DATABASES

Search and Information Retrieval

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Semantic Concept Based Retrieval of Software Bug Report with Feedback

LDIF - Linked Data Integration Framework

Development of Enterprise Architecture of PPDR Organisations W. Müller, F. Reinert

How To Build A Cloud Based Intelligence System

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Profile Based Personalized Web Search and Download Blocker

Linked Statistical Data Analysis

SmartLink: a Web-based editor and search environment for Linked Services

Semantic Exploration of Archived Product Lifecycle Metadata under Schema and Instance Evolution

A Business Case for Enterprise Content Integration using Ontology-based Content Analytics

Supporting Change-Aware Semantic Web Services

» A Hardware & Software Overview. Eli M. Dow <emdow@us.ibm.com:>

Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

The Ontology and Architecture for an Academic Social Network

Text Mining: The state of the art and the challenges

Lightweight Data Integration using the WebComposition Data Grid Service

Report on the Dagstuhl Seminar Data Quality on the Web

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

A Framework of User-Driven Data Analytics in the Cloud for Course Management

A Business Process Services Portal

LiDDM: A Data Mining System for Linked Data

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

An Overview of the Applications of Natural Language to Information Systems

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

CitationBase: A social tagging management portal for references

New Generation of Social Networks Based on Semantic Web Technologies: the Importance of Social Data Portability

Semantic Content Management with Apache Stanbol

Transcription:

Internet-Scale Data Management ing Heterogeneous Datasets on the Linked Data Web Challenges, Approaches, and Trends The growing number of datasets published on the Web as linked data brings both opportunities for high data availability and challenges inherent to ing data in a ally heterogeneous and distributed environment. Approaches used for ing siloed databases fail at Web-scale because users don t have an a priori understanding of all the available datasets. This article investigates the main challenges in constructing a and search solution for linked data and analyzes existing approaches and trends. André Freitas, Edward Curry, João Gabriel Oliveira, and Seán O Riain Digital Enterprise Research Institute The unprecedented availability of data promised by linked data 1 on the Web represents a major paradigm shift over the existing Web s structure. By building on Web infrastructure (URIs and HTTP), Semantic Web standards (such as the Resource Description Framework and RDF Schema [RDFS]), and vocabularies, linked data can effectively reduce barriers to data publication, consumption, and reuse, adding a rich layer of fine-grained, structured data to the Web. At its core, linked data exposes previously siloed databases as data graphs, which can be interlinked and integrated with other datasets, creating a global-scale interlinked dataspace. However, linked data poses challenges inherent to ing highly heterogeneous and distributed data. To linked data on the Web today, users must first be aware of which exposed datasets potentially contain the data they want and what data model describes these datasets, before using this information to create structured queries. This paradigm is deeply attached to the traditional perspective of structured queries over databases and doesn t suit the linked data Web s heterogeneity, distributiveness, or scale. It s impractical to expect Web data consumers to have a previous understanding of available linked datasets structure and location. Letting users expressively relationships in the data while abstracting them from the underlying data model is a fundamental problem 24 Published by the IEEE Computer Society 1089-7801/12/$31.00 2012 IEEE IEEE INTERNET COMPUTING

ing Heterogeneous Datasets on the Linked Data Web User From which university did the wife of Barack Obama graduate? Semantic gap SPARQL Linked data Web dbpedia: Barack_Obama dbpedia-owl: spouse PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbonto: <http://dbpedia.org/ontology/> SELECT? university WHERE { dbpedia-owl: almamater dbpedia: Michelle_Obama dbpedia: Princeton_University dbpedia-owl: almamater dbpedia: Harvard_Law_School rdfs: type rdfs: type dbpedia-owl: Educational_Institution rdfs: type rdfs: type dbpedia-owl: University } dbpedia: Barack_Obama dbonto:spouse?spouse.?spouse dbonto:almamater?university. (a) (b) (c) Figure 1. ing data over the Web. We can see (a) a natural over two search engines; (b) the corresponding SPARQL representation; and (c) the gap between the user s information needs and the data representation. for Web-scale data consumption, which, if not addressed, will ultimately limit linked data s utility for consumers. In addition to data model awareness, users ing linked data must master the syntax of structured s such as SPARQL. Most Web users aren t comfortable with structured queries, thus creating a usability barrier for the linked data Web. From a user perspective, natural queries emerge as a simple and intuitive alternative. Previous investigations have empirically confirmed natural s suitability for search and tasks. 2 This article provides a survey of existing approaches for searching and ing linked data on the Web, concentrating on how these approaches address the core challenges that emerge when heterogeneous datasets become exposed at Web-scale. Based on these challenges and approaches, this article also analyzes existing trends in the space. Linked data shares many of the objectives and challenges of dataspaces, 3 a concept that expresses the recurring demand for dealing with heterogeneous, loosely connected, and distributed data sources. Dataspaces, however, don t assume the support of linked data standards. Despite this difference, linked dataspaces and generic dataspaces share more commonalities than differences, and the analysis provided in this article can be transported to generic dataspaces. Living in a Linked Dataspace Linked data provides a data layer on the Web that represents objects and relations. The availability of Web-scale information in a structured and fine-grained representation could generate a paradigmatic shift in how applications and users consume data. Consider a journalist compiling a list of facts regarding public personalities and those personalities previous academic affiliations. The journalist can express his or her information needs as natural queries, such as From which university did the wife of Barack Obama graduate? Document search engines can t currently provide a level of interpretation that could point directly to the final answer. With a traditional search engine, the journalist must navigate through the links and read the content of each candidate page the search engine returns. Modern search engines such as Wolfram Alpha, which relies on manually curated structured knowledge sources don t provide a sufficiently comprehensive solution to answer this (see Figure 1a). The information that can answer this is already available on the Web as linked data. However, to access it, users must know datasets location and structure, and the syntax of the SPARQL (see Figure 1b). Figure 1c shows the gap between the user s information needs expressed in a generic natural and the data representation JANUARY/FEBRUARY 2012 25

Internet-Scale Data Management Keyword search -centric Expressivity Usability Figure 2. The expressivity usability trade-off for ing over structured data. The blue dots indicate that an ideal mechanism for linked data must provide both high expressivity and high usability. (This figure was adapted from previous work. 2 )... Structure search SPARQL in the target dataset. The s terms and structure differ from the data representation in the dataset. The linked data Web already contains valuable data in diverse areas, such as e-government, e-commerce, and the biosciences. Additionally, the number of available datasets has grown solidly since its inception. 1 The provision of intuitive and flexible mechanisms that can approximate users from an unconstrained amount of data represents a fundamental challenge, which, if not addressed, could affect the linked data Web s growth and adoption. Challenges for ing and Searching Linked Data Search engines on today s Web are based on variations of the vector space model (VSM). This model s scalability and simplicity of use, based on keyword queries, defined its success as the de facto solution for search engines for the Web of documents. The VSM represents the contents of a collection of documents in a vector space built from terms present in the collection. Traditional VSM solutions lack the representation of structure information needed for data queries. This is reflected in their alternative name, bag-of-words approaches. In the (semi-)structured data world, the relationships between entities in a dataset are fundamental to the model the dataset represents. Today, structured s are the standard way to structured data. Structured queries are essentially built from two components: the s syntax and the elements (entities and relationships) of the data model behind the dataset. The structured approach fails on the linked data Web, however, because the Web s scale makes it infeasible for users to become aware of the structure of datasets to them. Consequently, the linked data Web demands approaches that can combine VSMs usability and scalability with the expressivity required to (semi-)structured data, bridging the gap between users and the linked data Web. Figure 2 depicts the trade-off between expressivity and usability; existing approach es are positioned along an expressivity usability spectrum. This trade-off is a consequence of the gap for linked data queries. Ideally, a mechanism for linked data must provide both high expressivity and high usability (the blue dots in the figure). It should also employ a level of interpretation and matching not present in standard search and approaches. Previous works have proposed various solutions to address these challenges. To understand their strengths and limitations, we present five core challenge dimensions: expressivity is the ability to datasets by referencing elements in the data model structure, as well as to operate over the data (aggregate results, express conditional statements, and so on). Usability allows for an easy-to-operate, intuitive, and task-efficient interface. Vocabulary-level matching is the ability to ally match user terms to dataset vocabulary-level terms. reconciliation matches entities ex pressed in the to ally equivalent dataset entities. Semantic tractability mechanisms improve on the ability to answer queries not supported by explicit dataset statements (for example, Is Natalie Portman an Actress? can be supported by the statement Natalie Portman starred Star Wars, instead of an explicit statement Natalie Portman occupation Actress, which might not be present in the dataset). These challenges concentrate on the core usability and aspects necessary to address the usability expressivity trade-off. 26 www.computer.org/internet/ IEEE INTERNET COMPUTING

ing Heterogeneous Datasets on the Linked Data Web Existing Approaches Three high-level categories of approaches for ing linked data exist: approaches employing strategies inherited from the information retrieval (IR) space in which keyword search is mixed with elements from structure queries; approaches focusing on natural queries; and structured SPARQL queries over distributed datasets. Here, we focus on the usability and matching problems, thus analyzing approaches from the first two categories. Information Retrieval Approaches We can categorize IR approaches according to index type, which includes entity-centric search approaches and structure search approaches. Although both types provide hybrid search interfaces that merge keyword search with dataset structure elements, only structure search targets indexing strategies focusing on addressing the expressivity-usability trade-off at the index construction level. -centric search. -centric approaches let users search for entities (instances and classes) in datasets, employing VSM variations to index those entities. Existing approaches range from less expressive queries, based on keyword search over textual information associated with the dataset entities, to star-shaped queries and hybrid queries (that is, queries mixing keyword search, and structured queries centered on an entity). The Semantic Web Search Engine (SWSE) is a search and service that implements an architecture with components for crawling, integrating, indexing, ing, and navigating over multiple data sources. 4 The system architecture s main components include processing, ranking, an index manager, and an internal data store (YARS2), which focuses on scalability issues to enable federated queries over linked data. SWSE uses an approach called ReConRank to rank entities; 4 this approach adapts the Page- Rank algorithm to work over RDF datasets, propagating dataset-level scores computed from interlinking patterns to data-level entities. The Scalable Authoritative OWL Reasoner (SAOR) provides an RDFS and partial Web Ontology Language (OWL) reasoning engine to address scalability issues. 4 SAOR applies reasoning only on dataset fragments supported by an authoritative ontological definition. Sindice is a search and service for the linked data Web that ranks entities according to the incidence of keywords associated with them. 5 It uses a node-labeled tree model to represent the relationship between datasets, entities, attributes, and values. Similarly to SWSE, Sindice provides a comprehensive entitycentric search and indexing approach. Figure 3a depicts Sindice s architecture. -centric search approaches have developed comprehensive data management strategies for linked data on the Web, providing the infrastructure for managing the complete crawl index search cycle. These approaches also developed services complementary to the entitycentric search process that let users either visually explore (via Visinav 4 and Sigma 5 ) or execute full structured SPARQL queries over the crawled data. -centric approaches avoid major changes -centric search approaches have developed comprehensive data management strategies for linked data on the Web. in standard indexing strategies, inheriting index and search optimization mechanisms present in existing VSM frameworks. These approaches have avoided tackling the expressivity usability trade-off by aggregating multiple interfaces; in practice, to execute expressive queries, users must be aware of the vocabularies behind the datasets. In addition, most entity-centric approaches have only limited evaluation in terms of search result quality. Structure search. Structure search engines im prove keyword queries expressivity, extending existing inverted list indexes to represent structure information present in datasets. The main difference between entity-centric search and structure search is that the latter improves expressivity with support from the ex tended index. The search engine Semplore uses a hybrid formalism that combines keyword search with structured queries (that is, a subset of SPARQL). 6 Semplore uses position-based indexing JANUARY/FEBRUARY 2012 27

Internet-Scale Data Management (a) Barack Obama Barack Obama spouse?x Keyword Star-shaped PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbonto: <http://dbpedia.org/ontology/> SELECT? university WHERE { dbpedia: Barack_Obama dbonto:spouse?spouse.?spouse dbonto:almamater?university. } SPARQL interface SPARQL engine search Index Dataset Dataset Indexer Reasoner Dataset updater Output: ranked entities, SPARQL answer set Updater Crawler Datasets (b) Barack Obama spouse almamater?x?y Hybrid interface DFS-based retrieval Relationbased ranking Facets construction Output: ranked results Keyword index Concept index Relation index Indexer/ updater Datasets (c) From which university did the wife of Barack Obama graduate? interface Disambiguation dialog parsing Ontology lookup Consolidation Answer type identification SPARQL generation SPARQL execution Output: SPARQL answer set Mapping dialog Dataset (d) From which university did the wife of Barack Obama graduate? interface recognition search parsing Spreading activation search Graph merge Output: ranked triple paths indexing Datasets Semantic relatedness service Indexing Wikipedia WIKIPEDIA Figure 3. Examples of linked data search/ systems. We can see the high-level architecture components for (a) Sindice (entitycentric search), (b) Semplore (structure search), (c) FREyA (question answering), and (d) Treo (best-effort natural ). to index relations and join triples. It relies on three types of inverted indexes: keyword, concept, and relation. Semplore also explores user feedback strategies for improving search, providing a faceted and navigational interface. Figure 3b depicts Semplore s high-level architecture. Xin Dong and Alon Halevy propose an approach for indexing triples to enable queries that combine keywords and dataset structure elements. 7 To provide a more flexible matching, the authors propose four structured index types based on the introduction of additional structure information and enrichment in the inverted lists. Taxonomies associated with the dataset vocabularies are used as a enrichment strategy. Structure search approaches target the expressivity usability trade-off by modifying and extending traditional inverted index structures. They introduce a limited level of matching by taking into account the terminology-level information present in datasets or by enriching the index with related terms using WordNet. No comprehensive evaluation of the search results quality exists, making it unclear how these approaches 28 www.computer.org/internet/ IEEE INTERNET COMPUTING

ing Heterogeneous Datasets on the Linked Data Web perform in addressing the expressivity usability trade-off. Language Approaches Approaches in the literature based on natural queries target mechanisms with high usability and expressivity. Although some approaches focus on the question-answering (QA) problem, in which, similarly to databases, precise answers are expected as the output, others focus on a best-effort scenario that returns a ranked list of results. Question answering. The investigation of QA systems focuses on the problem of allowing users to data using natural queries. As opposed to IR techniques besteffort nature, QA systems target crisp answers, as with structured queries over databases. Work on QA approaches investigates the interpretation of users information needs expressed as natural queries, applying natural processing (NLP) techniques to parse queries and match them with dataset structures. Substantial research efforts have focused on this problem. We look at two recent works on open domain linked data. PowerAqua is a QA system that uses Power- Map, a hybrid matching algorithm comprising terminology-level and structural schema-matching techniques with the assistance of large-scale ontological or lexical resources. 8 In addition to the ontology structure, PowerMap uses WordNetbased similarity approaches as a approximation strategy. Exploring user interaction techniques, FREyA is a QA system that employs feedback and clarification dialogs to resolve ambiguities and improve the domain lexicon with users help. 9 Compared to PowerAqua, FREyA delegates a large part of the matching and disambiguation process to users. User feedback enriches the matching process by allowing manual entries of vocabulary mappings. Figure 3c depicts FREyA s high-level architecture. Compared to IR-based approaches, QA approach es aim toward more sophisticated matching techniques because they target queries with high expressivity and don t assume users are aware of the dataset representations (high usability). In contrast to entitycentric and structure search approaches, QA systems have a strong tradition of evaluating results' quality, having concentrated less on performance and scalability issues. Traditionally, QA approaches have focused on limited matching (WordNet-based) strategies, making them unable to cope with the Web environment s heterogeneity. Most QA approaches apply limited matching techniques (for example, synonymic, taxonomic similarity) for matching terms to dataset terms. In addition, they depend on resources that are manually created (Word- Net) and difficult to expand across different domains. Best-effort natural interfaces. Some recent approaches aim to merge natural queries expressivity and usability with IR models scalability and best-effort nature, targeting a best-effort natural search mechanism. As in QA systems, users can still enter full natural queries; however, instead of targeting crisp answers, these approaches return an approximate ranked list of results. Treo is a natural mechanism for linked data that uses relatedness measures derived from Wikipedia to match terms to dataset terms. 10 The use of relatedness measures allows the quantification of the proximity between two terms, using information which is embedded in large textual resources available on the Web such as Wikipedia. Wikipedia-based relatedness measures address previous limitations of WordNet-based matching. Treo s approach combines entity search, spreading activation search, and relatedness to navigate over the linked data Web graph, ally matching the parsed user to the data representation in the datasets. Figure 3d depicts Treo s components. In prior work, we generalized the principles of the Treo approach by constructing a distributional space (T-Space) for linked datasets. 11 We built this space using a distributional model based on statistical information derived from Wikipedia. This model enables flexible matching in the search process (we discuss distributional models in more detail later). The definition of the T-Space provides a principled representation of datasets focused on addressing the expressivity usability trade-off. JANUARY/FEBRUARY 2012 29

Internet-Scale Data Management Information retrieval approaches Structured queries Table 1. Strategies employed by each approach to address existing linked data ing challenges. Approaches -centric (SWSE/Visinav, 4 Sindice/Sigma 5 ) Structure indexes (Semplore, 6 Dong and Halevy 7 ) Questionanswering systems (PowerAqua, 8 Freya 9 ) search (Treo, 10 Treo T-Space) 11 Usability High Medium High High expressivity Keywords, star-shaped, 5 SPARQL Keywords, conjunctive/ path queries queries queries (no operators) Challenges Vocabularylevel matching No Taxonomy indexing, descriptions, and associations enrichment; 6 WordNet synonym 7 WordNet, ontology structure, 8 user enrichment 9 Wikipediabased, relatedness, Wikipedia Link Measure (WLM), 10 Explicit Semantic Analysis (ESA) 11 reconciliation OWL:same as, OWL: Inverse Functional No Dataset look-up, user feedback TF/IDF (instances) and ESA (classes) 11 SPARQL Low High No No No Improvement of tractability Contextual 5 and best-effort authoritative reasoning 4 (RDFS and OWL subset) No WordNetbased similarity Wikipediabased, relatedness, WLM, 10 ESA 11 *Cell shading reflects the level at which the proposed strategies address the challenges (light shading represents less coverage; dark shading represents greater coverage). Best-effort natural search ap proach es provide a more robust matching approach. However, they relax expectations in terms of results, delegating the results final assessment to end users. Similarly to QA systems, these approaches have concentrated on evaluating search results quality. Table 1 lists how each category addresses key usability and matching challenges. It also summarizes existing approaches strengths and limitations, depicting their complementary aspects. Finally, it analyzes how key features in existing systems can align to provide a comprehensive linked data solution. Taming Data Heterogeneity Our analysis of the existing approaches and how they address the challenges of ing linked data over the Web defines a landscape for the key features likely present in search and mechanisms over linked dataspaces. Seven key search and features emerge from this analysis as clear trends. Table 2 summarizes each feature s impact in the various challenge dimensions. We grouped the features by three main architectural elements: user interaction and interface, processing and search, and index. 30 www.computer.org/internet/ IEEE INTERNET COMPUTING

ing Heterogeneous Datasets on the Linked Data Web Architectural elements User interaction and interface processing and search Index Table 2. Key search and features and their impact on the set of challenges. Key features Complementary search and services User interaction and feedback mechanisms Best-effort model Use of natural processing techniques Distributional model Use of external knowledge sources for enrichment Integrated entity reconciliation techniques Challenges Usability expressivity Vocabularylevel matching reconciliation Improvement of tractability High High Medium Medium Medium High High Medium Medium High High Medium High High High Medium High Medium High High High High Complementary Search and Services -centric search, keyword-based search, natural queries, and structured SPARQL queries represent complementary search and services that might suit users in different tasks and purposes. Search and platforms should explore this complementary aspect with regard to heterogeneous data to enable users to switch among different search and strategies. SWSE and Sindice are exploring this trend; however, the availability of natural queries is a key feature not present in these systems. As part of the search and features, users should be able to explore, understand, and refine search results by relying on navigational, browsing, and filtering capabilities integrated into the process (this functionality is present in SWSE, Sindice, and Semplore). User Interaction and Feedback Mechanisms The presence of ambiguity and incomplete information is intrinsic to the search and process. As already explored in systems such as FREyA and Semplore, user feedback can help resolve ambiguities, enrich an application s model, and filter and post-process results. Best-Effort Model In If You Have Too Much Data, then Good Enough Is Good Enough, 12 Pat Helland summarizes the mindset shift that must occur in heterogeneous and distributed data environments, where many still expect the accurate and crisp results common for siloed databases. The challenge of building solutions with high usability and expressivity is coping with the data s heterogeneity at Web-scale; this demands relaxing our expectations of the results into a best-effort solution. Ranked lists of results in which users can assess those results suitability are widely used in document search engines; Web users have been extensively exposed to this approach and are thus familiar with best-effort search models. However, although document search engines can potentially return a long list of candidate documents, best-effort JANUARY/FEBRUARY 2012 31

Internet-Scale Data Management mechanisms for linked data should leverage the structure and types present in the data to target more concise answer sets. Note also the need to provide a supporting context around the answers that can help users assess the data s correctness. In the Treo approach, the path in the dataset generated during the ing process provides contextual information for users. A best-effort approach can live together with database operations, such as aggregations, via data filtering mechanisms that let users remove incorrect entries from the results (for example, using the associated type information). Language Processing Techniques For many years, the difficulties associated with the hard constraints of the QA problem have overshadowed the potential for applying NLP techniques for queries. NLP has developed a large set of techniques and tools for parsing and analyzing users information needs expressed as natural queries. Different flavors of syntactic parsers, morphological analyzers, and named entity recognition techniques are widely and effectively employed in different problems and in QA systems and natural search interfaces (for example, Power- Aqua, FREyA, Treo, and Treo T-Space). Recently, NLP techniques efficacy was demonstrated in the IBM Watson system, 13 which outperformed its human contestant in a Jeopardy challenge. Watson heavily leverages standard NLP techniques to build a complex information extraction and search pipeline. Search and mechanisms can explore NLP techniques to provide expressive and intuitive interfaces. Distributional Semantic Model The difficulty in effectively providing a robust matching solution has been associated with a level of interpretation that depends on fundamental and hard problems in artificial intelligence, such as commonsense knowledge representation and reasoning. Recently, however, distributional approaches are emerging as grassroots solutions to provide robust matching by leveraging the use of information embedded in large amounts of Web corpora. Distributional models assume that the context surrounding a given word in a text provides important information about its meaning. 14 Distributional s focuses on constructing a representation of a word based on the statistical distribution of word co-occurrence in texts. The availability of high-volume and comprehensive Web corpora has made distributional models a promising approach for building and representing meaning. However, the simplification of distributional models implies some constraints on its use as a representation. Distributional models are suitable for computing relatedness, which can act as a best-effort solution for providing robust matching solutions for linked data queries (present in the Treo T-Space system). External Knowledge Sources for Semantic Enrichment The availability of large amounts of unstructured text and structured data on the Web can help to bootstrap a level of interpretation based on available open and domain-specific knowledge. It is possible to address the volume of unstructured text corpora necessary to build distributional models by using comprehensive knowledge sources available on the Web, such as Wikipedia (present in the Treo and Treo T-Space systems). In addition, it is possible to use the ally rich entity structure of data sources such as DBPedia (http://dbpedia.org), YAGO (www.mpi-inf.mpg.de/yago-naga/yago/), and Freebase (www.freebase.com) as a generalpurpose entity and entity typing system that can easily integrate to the target datasets to provide a minimum level of structured commonsense knowledge, and which can later be used to improve interpretation and tractability. RDF s standardized graph-based format facilitates the reuse and integration of existing data sources into target datasets. Integrated Reconciliation Techniques Existing search and approaches haven t fully integrated current solutions (for example, similarity-based) for entity reconciliation (ER) into the index construction process, leaving a functional gap that must be addressed by future mechanisms by applying more principled ER solutions. The emergence of heterogeneous and distributed Web-scale data environments, in contrast to small, controlled schema databases, fundamentally shifts how users data. Our analysis of the state of the art shows that existing 32 www.computer.org/internet/ IEEE INTERNET COMPUTING

ing Heterogeneous Datasets on the Linked Data Web approaches based on IR and natural interfaces have complementary features, which, if combined, can provide solutions to existing usability and matching challenges. Some of these features suggest important trends that will become key functionalities in future search and mechanisms. The challenges involved in constructing effective mechanisms for Web-scale data offer an opportunity to converge three very active research areas, bringing together databases, IR, and natural processing. The results emerging from this convergence will profoundly affect how humans interact with information. Acknowledgments This work has been funded by Science Foundation Ireland under grant number SFI/08/CE/I1380 (Lion-2). We thank the reviewers and editors for their careful and valuable feedback. References 1. T. Berners-Lee, Linked Data Design Issues, 2009; www.w3.org/designissues/linkeddata.html. 2. E. Kaufmann and A. Bernstein, Evaluating the Usability of Language Languages and Interfaces to Semantic Web Knowledge Bases, J. Web Semantics: Science, Services, and Agents on the World Wide Web, vol. 8, 2010, pp. 377 393. 3. M. Franklin, A. Halevy, and D. Maier, From Databases to Dataspaces: A New Abstraction for Information Management, SIGMOD Record, vol. 34, no. 4, 2005, pp. 27 33. 4. A. Hogan et al., Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine, J. Web Semantics, to appear, 2011. 5. R. Delbru, S. Campinas, and G. Tummarello, Searching Web Data: An Retrieval and High-Performance Indexing Model, J. Web Semantics, to appear, 2011. 6. H. Wang et al., Semplore: A Scalable IR Approach to Search the Web of Data, Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 3, 2009, pp. 177 188. 7. X. Dong and A. Halevy, Indexing Dataspaces, Proc. 2007 ACM SIGMOD Int l Conf. Management of Data, ACM Press, 2007, pp. 43 54. 8. V. Lopez, E. Motta, and V. Uren, PowerAqua: Fishing the Semantic Web, Proc. 3rd European Semantic Web Conf. (ESWC 04), vol. 4011, Springer, 2004, pp. 393 410. 9. D. Damljanovic, M. Agatonovic, and H. Cunningham, FREyA: An Interactive Way of ing Linked Data Using Language, Proc. 1st Workshop on Question Answering over Linked Data (QALD-1), Collocated with the 8th Extended Semantic Web Conf. (ESWC 11), 2011. 10. A. Freitas et al., ing Linked Data Using Semantic Relatedness: A Vocabulary Independent Approach, Proc. 16th Int l Conf. Applications of Language to Information Systems (NLDB 11), Springer, 2011, pp. 40 51. 11. A. Freitas et al., A Multidimensional Semantic Space for Data Model Independent Queries over RDF Data, Proc. 5th IEEE Int l Conf. Semantic Computing (ICSC 11), IEEE Press, 2011, pp. 344 351. 12. P. Helland, If You Have Too Much Data, then Good Enough is Good Enough, Comm. ACM, vol. 54, no. 6, 2011. 13. D. Ferrucci et al., Building Watson: An Overview of the DeepQA Project, AI Magazine, vol. 31, no. 3, 2010, pp. 59 79. 14. P.D. Turney and P. Pantel, From Frequency to Meaning: Vector Space Models of Semantics, J. Artificial Intelligence Research, vol. 37, 2010, pp. 141 188. Andre Freitas is a PhD student at the Digital Enterprise Research Institute (DERI), National University of Ireland, Galway. His main research interests include search, linked data queries, and provenance. Freitas has a BSc in computer science from the Federal University of Rio de Janeiro (UFRJ). Contact him at andre.freitas@deri.org. Edward Curry is a research leader at the Digital Enterprise Research Institute (DERI), National University of Ireland, Galway, and an adjunct lecturer at NUI, Galway. His projects include studies of enterprise linked data, energy informatics, information management, and community-based data curation. Curry has a PhD from NUI, Galway. Contact him at ed.curry@deri.org. João Gabriel Oliveira is a research intern at the Digital Enterprise Research Institute (DERI), National University of Ireland, Galway. His main research interests include natural processing and search. Oliveira is finishing a BSc in computer science at the Federal University of Rio de Janeiro (UFRJ). Contact him at joao.deoliveira@deri.org. Sean O Riain leads the e-business domain at the Digital Enterprise Research Institute (DERI), National University of Ireland, Galway. His research interests include the application of natural processing and Semantic Web technologies and standards in business information systems. O Riain has an MSc in distributed information retrieval from the National University of Ireland, Galway. Contact him at sean.oriain@deri.org. JANUARY/FEBRUARY 2012 33