978-1-4799-5313-4/14/$31.00 2014 IEEE 327



Similar documents
Why was it built? AGROVOC (big agriculture vocabulary developed by FAO) In 2004: > concepts in up to 22 languages

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

Semantic Search in Portals using Ontologies

2QWRORJ\LQWHJUDWLRQLQDPXOWLOLQJXDOHUHWDLOV\VWHP

Model Driven Interoperability through Semantic Annotations using SoaML and ODM

How To Make Sense Of Data With Altilia

ICT Tools for the Discovery of Semantic Relations in Legal Documents

SmartLink: a Web-based editor and search environment for Linked Services

JOURNAL OF OBJECT TECHNOLOGY

Ontology-Based Discovery of Workflow Activity Patterns

Semantic Exploration of Archived Product Lifecycle Metadata under Schema and Instance Evolution

A Semantic web approach for e-learning platforms

How To Write A Drupal Rdf Plugin For A Site Administrator To Write An Html Oracle Website In A Blog Post In A Flashdrupal.Org Blog Post

Semantic Knowledge Management System. Paripati Lohith Kumar. School of Information Technology

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management

Supporting Change-Aware Semantic Web Services

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications

Distributed Database for Environmental Data Integration

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

Semantic Interoperability

LDIF - Linked Data Integration Framework

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

Ontology for Home Energy Management Domain

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo

Ontology and automatic code generation on modeling and simulation

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Selbo 2 an Environment for Creating Electronic Content in Software Engineering

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together

DISCOVERING RESUME INFORMATION USING LINKED DATA

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

Applying semantics in the environmental domain: The TaToo project approach

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

CitationBase: A social tagging management portal for references

An experience with Semantic Web technologies in the news domain

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Semantic Stored Procedures Programming Environment and performance analysis

A prototype infrastructure for D Spin Services based on a flexible multilayer architecture

María Elena Alvarado gnoss.com* Susana López-Sola gnoss.com*

DBpedia German: Extensions and Applications

Databases in Organizations

MicroStrategy Course Catalog

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

A Business Process Services Portal

Context Capture in Software Development

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Sisense. Product Highlights.

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

Data Mining Governance for Service Oriented Architecture

Big Data and Analytics: Challenges and Opportunities

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Business Intelligence and Service Oriented Architectures. An Oracle White Paper May 2007

Natural Language to Relational Query by Using Parsing Compiler

A Service-oriented Architecture for Business Intelligence

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

AGRIS: an RDF-aware system in the agricultural domain

Intelligent interoperable application for employment exchange system using ontology

How To Use Networked Ontology In E Health

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

Secure Semantic Web Service Using SAML

The AGROVOC Concept Server Workbench: A Collaborative Tool for Managing Multilingual Knowledge

The Prolog Interface to the Unstructured Information Management Architecture

COMP9321 Web Application Engineering

Scope. Cognescent SBI Semantic Business Intelligence

KNOWLEDGE-BASED VISUALIZATION

SPC BOARD (COMMISSIONE DI COORDINAMENTO SPC) AN OVERVIEW OF THE ITALIAN GUIDELINES FOR SEMANTIC INTEROPERABILITY THROUGH LINKED OPEN DATA

Unified Batch & Stream Processing Platform

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Annotation for the Semantic Web during Website Development

Achille Felicetti" VAST-LAB, PIN S.c.R.L., Università degli Studi di Firenze!

Publishing Linked Data Requires More than Just Using a Tool

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

TopBraid Insight for Life Sciences

Appendix A: Inventory of enrichment efforts and tools initiated in the context of the Europeana Network

ELPUB Digital Library v2.0. Application of semantic web technologies

Reason-able View of Linked Data for Cultural Heritage

THE SEMANTIC WEB AND IT`S APPLICATIONS

[JOINT WHITE PAPER] Ontos Semantic Factory

BSC vision on Big Data and extreme scale computing

DataBridges: data integration for digital cities

A GENERALIZED APPROACH TO CONTENT CREATION USING KNOWLEDGE BASE SYSTEMS

ONTOLOGY-BASED MULTIMEDIA AUTHORING AND INTERFACING TOOLS 3 rd Hellenic Conference on Artificial Intelligence, Samos, Greece, 5-8 May 2004

Vertical Integration of Enterprise Industrial Systems Utilizing Web Services

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Application of OASIS Integrated Collaboration Object Model (ICOM) with Oracle Database 11g Semantic Technologies

M2M Communications and Internet of Things for Smart Cities. Soumya Kanti Datta Mobile Communications Dept.

Collaboration on the Social Semantic Desktop. Groza, Tudor; Handschuh, Siegfried

Information, Organization, and Management

A generic approach for data integration using RDF, OWL and XML

FIPA agent based network distributed control system

SERENITY Pattern-based Software Development Life-Cycle

Lightweight Data Integration using the WebComposition Data Grid Service

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Transcription:

ART Lab infrastructure for semantic Big Data processing Manuel Fiorelli, Maria Teresa Pazienza, Armando Stellato, Andrea Turbati ART Research Group, Dept. of Enterprise Engineering (DII) University of Rome, Tor Vergata Via del Politecnico 1, 00133 Rome, Italy {fiorelli, pazienza, stellato, turbati}@info.uniroma2.it Abstract In this paper we briefly describe the ART Lab infrastructure for semantic Big Bata processing. Our most relevant contribution is the definition of an architecture supporting ontology development driven by knowledge acquired from heterogeneous resources, such as documents and web pages. The overall perspective is to propose a gluing architecture driving and supporting the entire flow of information, from data acquisition from external heterogeneous resources to their exploitation for RDF triplification. In such an architecture, the unstructured content analysis capabilities of frameworks such as UIMA are integrated in a coordinated environment supporting the processing, transformation and projection of produced metadata into RDF semantic repositories, which are managed by Semantic Turkey, our platform for Knowledge Acquisition and Management. Further contributions relate to the possibility of easily managing high dimension repositories (e.g., thesauri, vocabularies, etc.), and supporting end users for sharing the logics under the reasoning processes! Keywords big data; semantic processing; architecture I. INTRODUCTION We are exposed daily to a huge volume of information, which is fired from a plethora of channels, encoded in different formats and made available/accessible in more or less opaque ways. Efficient Information Management and, before that, Information Gathering, are thus becoming fundamental aspects of every organization exposed to a considerable flow of data. The Linked Data [1, 2] paradigm allows to publish information in the form of interlinked datasets, although much of the content on the Web is still not conformant to these uniform publication standards. Modeling principles for the Web of Data have reached a high maturity; non-trivial aspects associated to knowledge elicitation, such as identity of resources and its resolution, have been widely discussed in literature, and they would deserve a coherent handling instead of ad hoc solutions. Unstructured/structured/partially structured information accessible on the net (and produced by devices of all types) is more and more filling up our storages, while organizations are strongly interested in managing such a big amount of information in relatively short processing time. Data warehouse appliances are considered a solution, nonetheless a few questions arise: what about unstructured data (the so-called user generated content, mainly texts in natural languages) that carry so much information? what about relying on suitable and efficient architectures and new technologies (tools and processes) that an organization must be aware with, to handle so big amount of data? what about support for business professionals involved in integrating structured big data with context aware big data? It is a matter of fact that currently just a few professionals have mature enough competences to do so: specific skills would be identified and appropriately inserted into dedicated curricula. Attention must not be limited to the management of large repositories, statistical analysis and data warehousing topics; among others, logics is required to orchestrate the sharing of different kinds of information among different business applications. Semantics get a relevant role in information management to support and identify trends: in fact the interest is in capturing hints about the future more than in assembling info over the past! Special attention has been devoted to this aspect in setting up the ART Lab infrastructure. II. STATE OF THE ART Integrated frameworks for developing innovative applications are becoming always more of interest for both private and public, industrial and scientific, market and governmental organizations; semantic Big Data processing configures currently as the most promising research frontier over the Web! In the field of Knowledge Acquisition, and with particular reference to the development of Semantic Web datasets, we have seen emerging approaches aiming at supporting users in the process of acquiring and collecting relevant unstructured information from various media channels, and synthesizing the above into structured RDF data. 978-1-4799-5313-4/14/$31.00 2014 IEEE 327

OntoLearn [3] provided a methodology, algorithms and a system for performing different ontology learning tasks. The Protégé plugin OntoLT [4] was the first approach exploiting the interactive development capabilities of an ontology editor for supporting the knowledge acquisition process, seeing acquisition, management and validation of knowledge as indissoluble user experiences. Text2Onto [5] embodied instead a first attempt to realize an open architecture for management of (semi)automatic ontology evolution process. All previous systems are concerned with the task of Ontology Learning, which usually deals with the evolution of an OWL vocabulary. More recently, the extraction of ground data about facts, which can be modeled according to different existing vocabularies, has become a leading interest for the development of the Linked Open Data. LODifier [6] is a proof-of-concept implementation that converts open-domain unstructured natural language into Linked Data, with the use of NLP techniques such as Named Entity Recognition (NER), Word-Sense Disambiguation (WSD) and deep semantic analysis, embedding the RDF output in the Linked Open Data (LOD) cloud. Entity Extraction [7] is an end-to-end system that extracts entity relations from plain text and attempts to map the extracted triples onto the DBpedia namespace. Ontology Learning has very clear and defined objectives, such as creating new classes, establishing hierarchical relationships, defining axioms etc that allowed wider experimentation in creating tools, such as previously cited OntoLT, delivering a user-centric experience in acquiring and modeling new ontologies. The acquisition of ground facts, on the contrary, requires a rich vertical exploration of a given domain, with systems being able to recognize pertinent information and being able to relate identified entities according to the most appropriate relations defined over the domain. Defining a common umbrella for these vertical operations and, consequently, design interaction modalities for these systems, are not easy tasks. Such systems should thus probably follow a divide-etimpera approach, by delegating to other systems those capabilities for content extraction, triplification, identity resolution, and alignment, while simultaneously being able to interact with humans in checking, validating and possibly improving the acquired content. A few attempts at empowering information extraction architectures with data projection mechanisms are represented by the RDF UIMA CAS Consumer 1, Clerezza-UIMA integration 2, Apache Stanbol 3 and CODA (Computer-aided Ontology Development Architecture) [8]. The scope of previous three approaches is limited to performing a mere vocabulary-agnostic RDF serialization of 1 http://uima.apache.org/downloads/sandbox/rdf_cc/rdfcascons umeruserguide.html 2 http://incubator.apache.org/clerezza/clerezza-uima/ 3 http://stanbol.apache.org/ the CAS (Common Analysis Structure), not including any machinery for projecting information towards specific vocabularies. Stanbol has been addressing this aspect through a specific component called Refactor Engine 4, while CODA has instead been natively supporting the whole process through a transformation language, PEARL (ProjEction of Annotations Rule Language) [9], which can be extended with functions invoking dedicated components for all the tasks related to knowledge acquisition, evolution, management and reasoning over the data. A deep analysis and description of CODA framework are in [8]; we refer to that publication for a complete description. In line with the objective to support quick and easy development of new applications, the proposed platform is oriented towards a wide range of beneficiaries, from semantic applications developers to final users that can easily plug CODA components into compliant desktop tools. III. BIG DATA SEMANTIC PROCESSING INFRASTRUCTURE The overall perspective of the ART Lab research effort has been the definition of a gluing architecture driving and supporting the entire flow of information, from data acquisition from heterogeneous resources to their exploitation for RDF triplification, combined with effective visualizations. In such a context specific Big Data matters (as variability, volume, velocity, veracity, visualization) are dealt with! In our architecture, depicted in Fig. 1, the unstructured content analysis capabilities of frameworks such as UIMA (Unstructured Information Management Architecture) [10] are integrated in a coordinated environment, which supports the processing, transformation and projection of produced metadata into RDF Semantic Repositories, managed by Semantic Turkey 5 [11] - our platform for Knowledge Acquisition and Management. Among others, we had the non-trivial objective of realizing a platform for the management of big RDF datasets, starting from an underspecified knowledge model, while being the triple store technologies still in their infancies. In fact, this is a typical context, in which Big Data applications have to be developed. Dedicated support is required for distributed and collaborative editing as well as change validation within a formalized editorial workflow. Special attention to user roles and edit rights is also required. A few ontology-editing platforms are already available, but usually they do not support collaboration nor formalized workflows with user roles and editing rights. The initial and first purpose of CODA, as its acronym suggests, is to provide a machine-powered support to users developing ontologies. Ontologies are intrinsically software objects, and thus mentioning the support provided by machines could sound rather redundant. However, the spirit is the same of CASE (Computer-Aided Software Engineering) tools, that is, to provide not only a tool for editing ontologies, but also all the functionalities and machinery necessary to automate the 4 http://stanbol.apache.org/docs/trunk/components/rules/refactor.html 5 http://semanticturkey.uniroma2.it/ 328

Extraction (UIMA) Annotated content Triplification (CODA) Annotated RDF triples Validation & Enahncement (CODAVis) Select RDF resource Ontology-based suggestions Validated and enhanced RDF triples Big Resources (VocBench) Reasoning and Visualization (HORUS) Reference RDF semantic repository Collaborative & distributed editing Community of editors Fig. 1. ART Lab infrastructure for semantic Big Data processing process, facilitating the task and reducing the effort of the human user. Ontologies, in fact, are special kind of objects, and their editing requires different expertise and knowledge, difficult to find in a unique person. Therefore, we developed an interactive system supporting semi-automatic acquisition of knowledge for RDF datasets CODAVis [12]. It focuses on end users typically domain experts and/or knowledge engineers by supporting them in analyzing and validating the RDF triples generated by CODA. However, human users possess a resource that should not be underexploited: a wider background knowledge than the automatic processes that prompt them with suggestions. Their role should thus not be limited to analysis and validation; on the contrary, they should be active actors in improving the results produced by the machine, helped in this task by the machine itself! CODAVis naturally completes CODA, towards the realization of the envisioned Computer-Aided ontology development process. CODA has already pursued this objective by embracing a coordinated workflow in which different aspects of the knowledge acquisition, evolution and management processes can be properly addressed. Thanks to its layered architecture, CODA enables a clear separation of developer roles. As such in CODA, NLP experts work on the extraction of information, while RDF specialists focus on several separable tasks such as: normalizing the acquired values, defining the projection according to the target vocabularies, managing the resolution of identities with respect to entities found in different sources. The key aspect in getting the best from Human Computer Interaction is to put at disposal of human users all the computational power and artificial intelligence which can ease and boost their advanced decisions. The kind of support provided to the human is thus in terms of system architecture. IV. Human participation to the knowledge acquisition process can dramatically improve the quality of the results. CODAVis enables the human involvement, by providing end users with: SEMI-SUPERVISED KNOWLEDGE ACQUISITION As previously mentioned, we are concerned also with business professionals interested in integrating structured big data with context aware big data. The specific ontology structure and content require further tools for visualizing the extracted information in different modalities: triple-based for knowledge experts, relation-based for end users, graphical-based for domain experts. an effective presentation of the extracted metadata, metadata projection according to the domain of interest and its associated ontology vocabulary, dedicated interaction modalities validation and enrichment activities. supporting the While humans can handle small data in a relatively informal manner, the specific support provided by CODVis gets relevance in assuring the scalability of the human involvement as the data volume grows up. 329

The typical CODA scenario presents the acquisition of information from an external source, such as a document corpus, and its projection over a given target vocabulary. This is not limited to ontology population cases. A system based on CODA can be instantiated and deployed by assembling a UIMA extraction pipeline and other specific CODA components into a concrete system thanks to CODA workflow mechanisms (see the following section). The output of a CODA based system consists in a set of RDF triples that can be fed to the target dataset. Before committing them, produced triples can be inspected by the user, who can thus alter the results through dedicated tools. CODA-Vis supports this validation and enrichment phase, while using Semantic Turkey as a scalable manager for the semantic content. The validation aspect is conceptually simple, as actions are limited to accept/reject individual triples. Additional features, enabling better interaction and review of the source content, are: Exploration of the source data which originated the suggestion, and of the semantic context of the elements in the triple; Alternative ways for representing the acquired content (e.g., list of triples, focus on common subjects, or groups of subject/predicate, etc.). However, the most interesting aspect for which the human user actively enters into the knowledge acquisition process consists in the possibility of enhancing the automatically acquired knowledge, in order to augment its preciseness and binding to the specific domain. Let us assume that a UIMA Named Entity Recognizer (NER) has identified the mention of the person Armando in the document. Then, CODA, driven by a PEARL rule, produces the following triple: art:armando rdf:type ex:person However, a further evolution of the ontology for the Academic domain, contains specializations of the generic class ex:person, such as ex:researcher, ex:professor, and so on In order to enable triple modification and improvement, CODAVis has to present and organize the relations between the extracted information with the live underlying RDF dataset, thus supporting the user in more effective decisions. The importance of the visualization and comparison with the RDF dataset stands out exactly in the process of modification (enhancement) of an extracted RDF triple: the visualization of a class tree can bring the user to a specialization of the extracted information, improving the relevance and accuracy of the triple. With respect to the current example, the user, being aware of the textual context of the triple, might select the more specific class ex:researcher, thus producing the more specific triple: art:armando rdf:type ex:research V. TECHNICAL INSIGHTS For dealing with all previous aspects of semantic Big Data processing, our infrastructure integrates various technologies: 1. UIMA for analytics management, 2. CODA for metadata triplification, 3. Semantic Turkey as an extensible platform for knowledge based applications, 4. CODA-Vis as an extension of Semantic Turkey for human enabled knowledge acquisition, 5. HORUS [13] as a reason over semantic data, which is able to show the reasoning process to the user, 6. VocBench as a collaborative Web based thesaurus editor, based on Semantic Turkey. A part UIMA, all the rest have been developed in the context of ART Lab at Tor Vergata University. As previously stated, CODA generates RDF triples out of metadata produced by the UIMA annotators over unstructured content accessible on the Web. This process is driven by triplification rules expressed in the PEARL language [9], and is influenced by the target RDF dataset. The adoption of UIMA allows the reuse of existing analytics, which results in greater quality and reduces development effort. At the same time, the execution of PEARL rules adapt the extracted information to match the vocabularies and the modelling patterns found in the target semantic repository, while assuring the newly generated knowledge is properly integrated with the already assessed one. PEARL language has been enriched with a mechanism for annotating the output triples (or elements of them), in order to facilitate their interpretation by later processing components. This mechanism is general and domain agnostic, as the annotation semantics are not hard-wired in PEARL, but defined by dedicated extensions. In fact, the enhancing mechanism of CODA-Vis uses one of such annotation vocabularies. Continuing the example introduced in the previous section, the triple:?x rdf:type ex:person is associated with the annotation @Class, to indicate the fact that the resource ex:person is an owl:class. CODA-Vis elaborates the annotated RDF triples and looks into a correspondence table between annotations and enhancers, which are different interaction modalities to assist the use. In the specific case we are discussing, the enhancer for @Class provides the user with the subclasses of the class ex:person. By providing the user with a focused slice of the semantic repository, this enhancer clearly help the user to spot the right specialization of the concept. CODA-Vis supports plugging of new enhancers, by declaring their compatibility with correspondent annotations. As a further facility, CODA-Vis has an internal reasoner, which annotates the triples produced by CODA, when these annotations are applied in the PEARL rule. In fact, the 330

Fig. 2. HORUS inference graph aforementioned annotation is automatically generated by a simple reasoner, which recognizes the rdf:type property. VI. REASONER USE AND RESULTS VISUALIZATION There are several contexts in which users could be interested to follow the reasoning process. For instance: understanding how it behaves, comparing results in different application domains, comparing results with his own expectations related to specific conceptualizations. End users willing to know what is possible to infer from an ontology, and to understand how the inferred knowledge was produced, are interested in a tool for visualizing the result of inference processes. In general terms the inputs of a reasoner are: a vocabulary; the data stored in the ontology; a list of rules. Using those inputs a reasoner produces new knowledge, hopefully in the same standard in which the ontology is written. Once a reasoner has been chosen, it can be used in the following ways: as a standalone tool to infer new knowledge saving it in a particular serialization; as a component inside a framework to immediately observe the inferred knowledge The validation process for a reasoner is characterized on which inference is able to run (from a list of inference rules), its scalability regarding the size of the ontology it analyzes and the time it needs to process it. Currently, at our best knowledge, some reasoners (as the ones found in Protégé 4.3) can provide an explanation of the inferred knowledge just as a list of the explicit triples used by the reasoner; no further information is provided regarding the inferred knowledge produced and successively used along the reasoning process. For complex reasoning, it can be difficult to follow the entire process. For better supporting end user in becoming aware of result of semantic processing also in the context of Big Data, we decided to develop a reasoner (HORUS) characterized by the following features: being open source, implemented as a Java library, easy to add new rules using an intuitive language based on the RDF family standard (the rules are written using RDF triple), visualizing how inferred triples were generated from other RDF triples (explicit and inferred ones). HORUS can be used inside Semantic Turkey, thanks to an ad-hoc developed extension. In this case, the user has the possibility to decide which file containing the inference rule to 331

load, which rules among them to use, how many iterations the reasoner should do. Once the user has selected which rule file to load and which rules to use from that file, HORUS executes the reasoning process over the ontology currently managed by Semantic Turkey. At the end of such a process, the inferred RDF triples are added to the ontology in a separate graph (the user can decide to remove the graph, with all the inferred triple at any time). An example of the result graph can be seen in Fig. 2. In this case, we have executed HORUS on an ontology containing some information about politicians and their families. Some of the RDF triples contained in this ontology were about the Barack Obama s family, in particular: hasrelative type TransitiveProperty ; hasrelative type SymmetricProperty ; hasparent subpropretyof hasrelative ; hasspouse subpropretyof hasrelative ; Barack hasspouse Michelle ; MalieAnn hasparent Barack ; Natasha hasparent Barack ; HORUS is able to infer the triple Michelle hasrelative MaliaAnn. The justification of such inference is clearly shown in the graph depicted in Fig. 2. We now follow that reasoning step-by-step. From: Barack hasspouse Michelle ; hasspouse subpropretyof hasrelative ; the triple Barack hasrelative Michelle is inferred. Then, in a similar way, from: MaliaAnn hasparent Barack ; hasparent subpropretyof hasrelative ; the triple MaliaAnn hasrelative Barack is obtained. Combining the two inferred triples with the explicit triple hasrelative type TransitiveProperty, HORUS infers MaliaAnn hasrelative Michelle. Finally using: hasrelative type SymmetricProperty ; MaliaAnn hasrelative Michelle (previously inferred) ; HORUS is able to infer the triple Michelle hasrelative MaliaAnn. Using the same input RDF triple, HORUS is able to infer the triple Michelle hasrelative Natasha following the same reasoning as the triple Natasha hasparent Barack is similar to MaliaAnn hasparent Barack used in this example. Another feature of HORUS (as shown on the left side of Fig. 2) is that the end user is able to switch between the graph representation and the textual one. The latter consists in a log file containing the information on how each inferred triple was generated and the explanation for all the inconsistency contained in the analyzed ontology. In addition, the user can eliminate all the inferred triples from the ontology (by deleting the RDF graph in which HORUS stored them). In such a way the user is aware of why each new triple has been inferred using a graphical representation or by reading a log file containing all the motivations for each decision taken by the reasoner, and when dealing with a big ontology, the possibility to follow why a particular triple was inferred could be crucial. VII. BIG RESOURCES MANAGEMENT In the wider process of knowledge acquisition from the Web, a special role is assumed by resources: large, heterogeneous, public, private, sharable, etc. Special attention has been dedicated to make easy to access them and to use stored knowledge for different applications. Hereunder, we shall concentrate on the case of AGROVOC [14], a controlled vocabulary covering all areas of interest to the Food and Agriculture Organization 6 (FAO) of the United Nations, including food, nutrition, agriculture, fisheries, forestry, environment etc. A worldwide community of experts with different and heterogeneous backgrounds continuously edits the thesaurus, supported by a collaborative web based thesaurus editor, called VocBench (VB). In fact, to support the editing of AGROVOC, FAO developed VocBench, an open-source, web-based editing tool for multilingual thesauri and RDF-SKOS resources. The ART Lab of the University of Roma Tor Vergata (in the framework of a wider collaboration with FAO) proposed to merge VB with an already existing RDF management system: Semantic Turkey [11]. The resulting environment was perfect: wide support from the community (FAO and its partners) for assessing user interaction and the collaboration capabilities of VB on the one side, and a strong technological backbone provided by Semantic Turkey (ST) for managing RDF and coordinating components and system extensions. VocBench today is a fully-fledged Semantic Web collaborative platform for the development of SKOS-XL concept schemes, not bound to specific resource anymore. In Fig. 3, the VocBench 2.0 architecture is being shown: the presentation layer (together with the business logic related to multi user management) is located in the web application. This layer is powered by the Google Web Toolkit 7. The service and data layers almost coincide with the Semantic Turkey RDF framework (without its original UI, originally deployed as a Firefox extension). ST features an OSGi-based architecture, offering different extension points for adding new functionalities to the platform. The two main extension points cover new service extensions and pluggable connectors for different triple store technologies. Access to the RDF data is mediated by OWLART API 8, an RDF library developed by the ART Lab for providing an abstract layer over different triple store technologies. VB is reusing most of ST core services for basic system organization, management of RDF data and of SKOS (SKOS- XL in particular). However, additional services have been 6 http://www.fao.org/ 7 http://www.gwtproject.org/ 8 http://art.uniroma2.it/owlart/ 332

Fig. 4. VocBench 2.0 extensible architecture developed specifically for VB, and deployed as a dedicated OSGi bundle. For what concerns the pluggable triple stores, VB is distributed together with a connector for Sesame2 [15], supporting all of its storage/connection possibilities: in memory, native, remote connection and their configuration. Fig. 3. VocBench 2.0 user interface Other connectors for different RDF middleware (such as Jena), or still based on Sesame2 with slight modifications for custom clients (such as for Allegrograph 9 ) are available still as open source projects, though not regularly maintained. In FAO, AGROVOC and the other datasets are stored in an OWLIM [16] repository, accessed through a Sesame2 remote connection. In the near future, a dedicated OWLIM client allowing for create/delete operations at repository level will be developed. The Fig. 4 offers a typical view of VB, with the concept tree on the left, and the description of the selected concept on the right, centered on the term tab, listing all terms in the different languages available for the resource. Currently VB supports editing of concept schemes modeled in SKOS-XL, an extension vocabulary for SKOS, allowing for reified labels: labels are in fact represented as RDF resources, which have a skosxl:literalform (the true label value) and other additional properties, such as editorial metadata (date of creation/modification), further attributes or specific relationships with other concepts or labels. Offline tools are also available for different purposes: lifting SKOS thesauri to SKOS-XL, flattening SKOS-XL labels to simple literals connected through SKOS core label properties, performing 9 http://www.franz.com/agraph/allegrograph/ 333

integrity checks and model transformations. These tools support data manipulation operations required to import preexisting data or for meeting specific requirements for publishing data as Linked Open Data. VIII. CONCLUSIONS Developing new applications in the wider web context searches for users with deep technical competences to orchestrate the sharing of different types of information among different business applications. The very quick technology evolution, together with the limited knowledge of underlying models of heterogeneous data accessible on the Web, makes very difficult the work of application developers. In this paper it has been introduced (with a bird s eye view) what currently done inside the ART Lab at the University of Roma Tor Vergata, for making both possible and easy the deployment of new/effective applications based on Big Data. REFERENCES [1] T. Heath e C. Bizer, "Linked data: Evolving the web into a global data space," Synthesis Lectures on the Semantic Web: Theory and Technology, vol. 1, n. 1, pp. 1-136, 2011. [2] T. Berners-Lee, "Linked Data," 27 July 2006. [Online]. Available: http://www.w3.org/designissues/linkeddata.html. [Accessed 6 October 2013]. [3] P. Velardi, R. Navigli, A. Cucchiarelli e F. Neri, "Evaluation of ontolearn, a methodology for automatic population of domain ontologie," in Ontology Learning from Text: Methods, Applications and Evaluation, IOS Press, 2005. [4] P. Buitelaar, D. Olejnik e M. Sintek, "A Protégé Plug-In for Ontology Extraction from Text Based on Linguistic Analysis," in Proceedings of the 1st European Semantic Web Symposium (ESWS), Heraklion, Greece, 2004. [5] P. Cimiano and J. Völker, "Text2Onto - A Framework for Ontology Learning and Data-driven Change Discovery," in Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems, Alicante, 2005. [6] I. Augenstein, S. Pad o e S. Rudolph, "LODifier: Generating linked data from unstructured," in The Semantic Web: Research and Applications, Berlin, 2012, p. 210 224. [7] P. Exner e P. Nugues, "Entity extraction: From unstructured text to DBpedia RDF triples," in The Web of Linked Entities Workshop (WoLE 2012), Boston, 2012. [8] M. Fiorelli, M. T. Pazienza, A. Stellato and A. Turbati, "CODA: Computer-aided Ontology Development Architecture," IBM Journal of Research and Development, vol. 58, no. 2/3, pp. 14:1-14:13, 2014. [9] M. T. Pazienza, A. Stellato e A. Turbati, "PEARL: ProjEction of Annotations Rule Language, a Language for Projecting (UIMA) Annotations over RDF Knowledge Bases," in LREC, Istanbul, 2012. [10] D. Ferrucci e A. Lally, "UIMA: an architectural approach to unstructured information processing in the corporate research environment," Nat. Lang. Eng., vol. 10, n. 3-4, pp. 327-348, 2004. [11] M. T. Pazienza, N. Scarpato, A. Stellato and A. Turbati, "Semantic Turkey: A Browser-Integrated Environment for Knowledge Acquisition and Management," Semantic Web Journal, vol. 3, no. 3, pp. 279-292, 2012. [12] M. Fiorelli, R. Gambella, M. T. Pazienza, A. Stellato and A. Turbati, "Semi-Automatic Knowledge Acquisition Through CODA," in 27th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent Systems (IEA-AIE 2014), Kaohsiung, Taiwan, 2014. [13] G. L. Napoleoni, M. T. Pazienza and A. Turbati, "HORUS: a Configurable Reasoner for Dynamic Ontology Management," in Sixth International Conference on Advanced Cognitive Technologies and Applications (COGNITIVE 2014), Venice, Italy, 2014. [14] C. Caracciolo, A. Stellato, A. Morshed, G. Johannsen, S. Rajbhandari, Y. Jaques e J. Keizer, "The AGROVOC Linked Dataset," Semantic Web Journal, vol. 4, n. 3, p. 341 348, 2013. [15] J. Broekstra, A. Kampman e F. van Harmelen, "Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema," in The Semantic Web - ISWC 2002: First International Semantic Web Conference, Sardinia, Italy, 2002. [16] A. Kiryakov, D. Ognyanov e D. Manov, "OWLIM a Pragmatic Semantic Repository for OWL," in Int. Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2005), WISE 2005, New York City, USA, 2005. 334