15-16 September 2011, BULGARIA 1 Proceedings of the International Conference on Information Technologies (InfoTech-2011) 15-16 September 2011, Bulgaria THE SEMANTIC WEB AND IT`S APPLICATIONS Dimitar Vuldzhev National high-school of mathematics and science, Sofia e-mail(s): vouldjeff@gmail.com Bulgaria Abstract: In this paper, a brief introduction to the concept of the semantic web has been made as well as the idea of using ontologies. The main objective of this project is to implement such an application by relying on the studies conducted. Some of the problems which occurred during the course of realizing the aims are also discussed in this paper. Key words: semantic web, intelligent systems, ontologies, collaborative database 1. INTRODUCTION Have you ever been making a research and having to collect and process an enormous amount of data? At some point you want to be facilitated, you need computer help. The Semantic web (Daconta et al., 2003) is a concept which enables machines to understand the meaning of already existing data in the web. Actually this is nothing but a unified method for saving information using metadata data about the data or we could also call it machine representation. The aim of this project is to present an application of this kind, with data in Bulgarian, which will collect information from various sources and will help us finding just the right data, filtering it, etc. In the course of realization the objectives the following main problems were solved: Detailed study of the existing standards and similar applications; Choosing technologies; Considerations of automated methods for data extraction; Implementing a priority queue for delayed jobs; Adding a reliable system for tracking data changes; Optimization of the db schema and backend.
2 PROCEEDINGS of the International Conference InfoTech-2011 There are a few semantic web application most of them have taken a particular segment of the market, while several other tend to be the Semantic Wikipedia (Auer et al., 2007). Unfortunately, except for that the information there is only in English, they do not offer instruments for playing with the data (unless you are a programmer). 2. PROBLEM DEFINITION In order the Semantic web to become reality it is important a huge amount of data in standardized format to exist. What is more, not only the access to them is needed, but the relation between them, because in that way one resource will lead us to another one. All the interconnected collections of data are called Linked data. Linked data rely on two fundamental technologies in web - URI and HTTP. Although URI is recognized as a web address of a document it`s actual usage is to give a unique identity to every resource. The creator of the WWW Tim Berners-Lee defines linked data (Berners-Lee et al., 2009) as giving the following four rules: Use URI to name things; Use HTTP URI so that people could check those resources; When somebody opens certain URI give useful information using standards; Include other URI so that people could find new things. 2.1. Resource Description Framework The web space, to which we are so used to, consists of interconnected documents. In the semantic web we call thing resources. Shakespeare, Stratford are all examples for resources. That is why the fundamental technology is called Resource Description Framework. RDF is not a complex concept it is just a way for serializing statements. Consider the following example: The author of this project is Dimitar. Every RDF statement consist of three parts subject (this project), object (Dimitar) and predicate (author). Having in it in mind and the rules of Tim Berners-Lee we could build a graph.
15-16 September 2011, BULGARIA 3 Fig. 1: A simple graph, which visualizes the aforementioned statement. Every statement is called an RDF triple and the structure of predicates ontology. It may look to simple, but just because of that RDF is such an important part. All the RDF statements form a graph and the people in the field of computer science may tell us a lot about the efficiency of graphs. 2.2. Ontologies All the predicates which describe certain subject from the real world are called Ontology. Examples for an ontology are Person, Animal, Place, etc. The reason that ontologies are so important is that they define a standard for the predicates` names, because if each one of us calls one predicate however he likes the whole concept for the Semantic web loses its purpose there will be no communication between different systems. 3. PROBLEM SOLUTION 3.1. Architecture The architecture of the proposed application is tree-tier User, Business Logic, Database. For a database management system we use the so called document-orientated database MongoDB. In MongoDB, unlike typical relational databases, which keep data in many tables with relations between them, in document-orientated databases everything for a certain resource is saved in only one document (JSON formatted). Some of the major advantages, which influenced the choice are: document-orientated, without strict schema; support for arrays, hashes and embedded document; support for indexes; availability of so called atomic updates;
4 PROCEEDINGS of the International Conference InfoTech-2011 provides simple, but powerful query language + MapReduce; build-in methods for easy scaling. For developing the application itself we have chosen the programming language Ruby and the framework Rails - very powerful, agile and popular combination. At the core lies the MVC pattern division of the application in three parts model (database stuff), view (the user interface) and controller (business logic). 3.2. Automatic Information Extraction There must be a large amount of data in order the application to be useful. This is not within the reach of a man or at least for a reasonable period of time. The module for information extraction was not an easy task. In order to extract information you need to provide the resource`s name and say whether you want to translate it. Here is the process: 1. The resource is searched in Freebase and the result is loaded. 2. It translation is on, the result is passed to Google Translate. 3. Check for already existing resource with that key is made. If there is, only the new data will be saved. 4. For every ontology in the result a check for existence is made. If no, new one is constructed. 5. Every property from the ontology is processed and filled in the resource. If the ontology is new-made, the property is added to the schema. 6. After completion a flag to the new resource is added. 7. Extra information is extracted from Twitter, IMDb and other. 8. If during the execution of any of the steps an error has occurred, an Exception is thrown. 9. You are now able to view the newly extracted resource! The module also offers a rollback functionality everything which the extractor has made is changed to its previous state. 3.3. Delayed Jobs Operations such as automatic information extraction require more system resources, load the machines and take longer to execute. Therefore their execution during a standard user request is very ineffective and subverts the operation of the system as a whole. Such operations will be called jobs. Jobs have certain parameters and are added to a priority queue. Separate system processes called workers take one
15-16 September 2011, BULGARIA 5 task from the queue and start its execution. After success or failure the result is saved in a log. 3.4. Tracking changes In a system where everybody has the right to edit information it is possible that someone may abuse. And it will be pity the hardly collected data for certain resource to disappear just like that. That is the reason for the implementation of a tracking changes module, which allows reverts to previous versions. There are several know approaches for this task to keep the whole resource after every edition or to keep only the edition itself. Unfortunately both options have their drawbacks the first one takes a lot of system space and the in the second you have to make changes merge in order to read a resource. The approach used in the application is something in the middle. The last version is kept in the database as well as the old versions of only the fields changed. In this way we do not have to make merges while reading and it does not take a lot of space. Example: {title: test, description: description } After editing: {title: test, description: description1 extra: 123} and {description: description, added: [ extra ], version: 1} 4. CONCLUSION In this project a brief introduction into the world of the Semantic web has been made. The second part presents a semantic application. Its architecture is described database, platform. From the Eleventh Students Conference in January 2011 to present days the system has changed a lot. The module for automatic extraction works stable, a system for changes tracking has been implement and a lot of other things. As for future plans the presented application could further develop in some of the following areas: Creating a powerful module for ontology editing; Collection data from other sources; Using different algorithms for manipulating and using the existing information. The author hopes that, having in mind the nature of the problem, the application could provoke interest as well as being useful for the public.
6 PROCEEDINGS of the International Conference InfoTech-2011 REFERENCES Auer S., Bizer C., Kobilarov C., Lehmann J., Cyganiak R., Ives Z. (2007). DBpedia: A Nucleus for a Web of Open Data http://www.informatik.uni-leipzig.de/~auer/publication/dbpedia.pdf Berners-Lee T., Bizer C., Heath T. (2009). Linked Data The story so far http://tomheath.com/papers/bizer-heath-berners-lee-ijswis-linked-data.pdf Daconta M., Obrst L., Smith K.. (2003) The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management, Wiley