Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD Boris Mocialov (H00180016) MSc Software Engineering Heriot-Watt University, Edinburgh April 5, 2015 1
1 Introduction The purpose of this essay is to give an overview consisting of objectives of each field and technologies being of the two scientific areas, namely, the Big Data Management and Semantic Web Technologies. A short example will follow after overview of each area that would provide a context associated with that particular area. After an introduction, the essay will try to convince the reader that both areas are actually related by presenting some of the recent applications of the techniques of each field and that this relationship is exploited in the same way every time the combination of the two is used in practice. An additional section at the end will provide more detailed overview over one particular application of the both fields. 2 Introduction to Big Data Management Big Data is a term that describes possibly inconsistent uncertain data that resides in large volumes, different forms and is being produced at high speed. Given such description, tools that operate upon and manage Big Data should capture, process, and analyse the data accordingly to overcome mentioned difficulties. Big Data Management incorporates such tools and techniques to overcome these difficulties. Data Science Series (2012) gives an extended list of possible benefits for both businesses and customers of turning to the Big Data resources. As it can be seen from the list that Big Data can be advantageous to any company independent of the sector or niche it occupies as new opportunities in data-utilisation can be discovered and exploited. 2.1 Objectives As it had been said, Big Data Management is supposed to utilise appropriate tools and techniques to make it possible to capture, process, and analyse data that is fast, large, uncertain, and heterogeneous. Chen and Zhang (2014) present an exhaustive list of challenges posed by the Big Data for computing. The list includes storage problems, I/O speed, network throughput, data 2
curation, and processing power as an umbrella over more detailed challenges. All the listed challenges are, indeed, the objectives for the Big Data Management to reach overall aim to be able to store, process, and analyse large amounts of different uncertain data. 2.2 Technologies Given objectives and current challenges for the Big Data Management, Chen and Zhang (2014) discuss possible improvement approaches to allow for better handling of the Big Data. For instance, to improve upon inconsistent, incomplete, and/or noisy data, cleaning, integration, and transformation can be considered. The challenge becomes to perform all these tasks life - as data becomes available. One of the solutions for the fast processing is of course the parallel handling of the data. The current solution to Big Data Management that possibly comes from distributed sources is NoSQL databases. NoSQL databases are more of a philosophy rather than a technique or a tool. It describes a set of approaches the Big Data Management can be accomplished. For instance, some NoSQL databases may or may not use relation, some do not use SQL management language, and some may employ schema-free, schemaless, or flexible schema policies. In addition, different approaches to store data are being used. For example, some systems use key-value storage system, some variation is keydocument system, some turn to column-families type or even graph systems. What all the NoSQL databases have in common is their ability and devotion to dynamic schema as an underlying feature that serves as an advantage when dealing with different data. Another common factor is the separation between the storage and management of the data. While storage happens in one of the previously mentioned fashions, the management is implemented in the application layer, which means that when some dirty data is being extracted from the system, it is then dumped onto the application layer that is supposed to deal with what should be processed further and what is not needed for this particular extraction. Some of the state of the art approaches to Big Data Management that Chen and Zhang (2014) discusses include statistical analysis of the data at hand, data mining approaches and the use of neural networks together with machine learning algorithms to discover patterns in different data and cluster discovered items together to create classes. 3
2.3 Big Data Example As it can become apparent from description above, Big Data can provide additional revenues to any company that deals with data. Apart from monetary interest, Big Data can provide new knowledge to science as there is potential value hidden inside of any data. To present a simple, but powerful example, it is worth to mention the notion of smart cities. Data Science Series (2012) provides this as an example of Big Data as well, but smart cities can also be viewed as an encapsulation of services, such as health service, public service, transport service, and more. In the case of health services, patients can have their personalised doctor on their wrist that sends data to an actual doctor or even an AI that records data every moment of patients life and provides clues directly to the person on how to improve upon his/her life. In case of public services, for example, can monitor traffic developments, people gatherings, forums, etc. and act upon this data for the good of the citizens. As for the transportation services, public transport can cooperate and provide services only to the places where it is needed. 3 Introduction to Semantic Web Technologies Semantic web is an idea of adding meaning to the things that are found on the World Wide Web. The purpose of the added meaning is to allow machines to reason about these things. 3.1 Objectives Shadbolt et al. (2006) writes that e-science - the source of the need for the technology, is a major driver for the semantic web for reasons of data integration between heterogeneous data sets that come from different scientific communities. Such integration can be achieved through the use of ontologies - standard for formal namings/definitions/properties/relations of entities within one particular domain. Rationale behind integration of data from wide ranges of fields is inspired by the movement towards interdisciplinary aspects of the science - fusion of different disciplines for the pursuit of acquiring new shared knowledge. Therefore, certain standards should be enforced to allow for distributed and heteroge- 4
neous data to merge into meaningful unambiguous knowledge in any domain. 3.2 Technologies The key technologies (rather techniques) in semantic web are URIs that identify various resources. Given a URI to a resource, anyone can tap onto it. URIs is a building block of RDFs that describes every part of a subject-predicate-object triple that, in turn, relates subject to an object. When building an application, RDF vocabulary can be used to specify domain of predicates used within that application. RDF vocabulary serves as an abstraction over distinct RDFs and provides one-point entry for the vocabularies to be linked. RDF Schema (RDFS) is even further abstraction of RDF that provides description of groups of related resources. While RDF Vocabulary is optional, RDF Schema is mandatory. Triple stores, further, extend individual RDFS to provide facilities for richer RDF content. To provide a standardised access to triple stores, SPARQL language had been developed to query the underlying RDFs. OWL languages provide means for adding extra information into RDFS to make the knowledge more representative. In addition, OWL languages support ontology consistency checking (Shadbolt et al., 2006). Switching to tools, it is worth mention Protege, an ontology editor and validator. 3.3 Semantic Web Application Example A commonly cited example of semantic web applications is, perhaps, e-science. As ontologies can be distributed and combined by such technologies as, for example, OWL languages, e-sciences can work in distributed fashion by synchronising their findings and build common knowledge while maintaining a common ontology that would define the domain and range of the research both parties are engaged in. As long as common ontology is defined and obeyed during synchronisation, both parties can make changes to their underlying models, terms and definition as they wish (as local requirements/laws may enforce such differences). 4 Relationship between Big Data and Semantic Web Areas, identified by Data Science Series (2012) had been considered to identify relationship between the two fields. 5
4.1 Semantic Link Network for Big Data in Multimedia Paper by Liu et al. (2014) uses a particular approach to organise multimedia resources with the use of texts and surrounding texts. The aim of the project is to give meaning to different multimedia resources and allow users to search related resources and to be able to gain a more comprehensive meaning of a particular resource given its relationships. Authors main assumption is that the manual annotations can be considered as a reliable source of semantics. Also, it is mentioned that ontologies can describe multimedia semantics. The aim of the paper becomes to bridge a gap between ontologies and manually given annotations. Motivation for the paper is to provide reasoning to be able to derive the implicit knowledge from information. Common applications for the derivation of implicit knowledge can be found useful in many areas, such as surveillance, sports, or Internet of Things. Semantic Link Network method is employed to associate relationships between resources. Since every aspect in the Semantic Web is a triple, as it had been pointed out earlier, mapping can be accomplished without any considerable modifications. During the presentation of the results, certain heuristics were applied to filter the underlying assumptions of the model even further. As a result, with the use of ontologies and tags along with textual descriptions, semantic relatedness had been achieved between multimedia items accurately and robustly. 4.2 Personalised Medicine with Big Data and Semantic Web Technologies Panahiazar et al. (2014) considers a patient, who requires personalising health care plan. To accomplish this requirement, a health care system has to implement a new infrastructure that would allow live delivery of patient data directly into the hands of a professional. The other side of the equation would allow health care systems to make better decisions about their patients based on the data from all the patients. The paper discusses an approach towards personalised health care using big data and Semantic Web technologies. 6
Smart data notion is introduced into the context of health care as a fusion between the Big Data and Semantic Web. The Big Data part of the smart data deals with accessing and processing large volumes of homogeneous and heterogeneous data about every single patient. Since the data is not structured most of the time, Semantic Web technologies come into play and are used to annotate various concepts. 4.3 Information and Data Sharing in Chemical Sciences Bird and Frey (2013) provide in-detail rationale behind the importance of data and knowledge sharing in the chemical sciences. e-reasearch is a direct consequence of the expansion of available to researchers data. As more work power is required to process the available data, the more need emerges in use of distributed collaborations, so that collaborative bodies can tackle problem of Big Data in sciences. In addition to workforce, scientists depend upon each other s work more than ever. Single-entry database solutions are not feasible to accommodate for all the research centres and universities. Therefore, a distributed approach must be taken. Although the distributed approach is feasible serving as a boilerplate for all the research happening in one field, additional infrastructure should be in place to allow discovery, browsing, documentation, etc. This would in turn allow for the provenance of the data, so that the initial baseline can be frozen and not changed any more after it had been shared. For such a system it would be important to use a controlled vocabulary that would ensure that all the parties belonging to the system use that vocabulary when describing certain aspects of the research. 4.4 Linking Smart Cities Data Yet another example comes from Celino et al. (2012), who report on the implementation of an application that engages users to provide information about a city to fix inconsistencies in automatic inferences made by reasoning software regarding a specific ontology. It had been noted from similar applications that users are willing to provide information if the application supports GWAP paradigm. In other words it can be said that the crowd can foster the connection between the Big Data and the Semantic Web Technologies given appropriate infrastructure. Author also notes that similar works had been done that covered the whole Semantic Web life-cycle rather than the fine-tuning part. 7
5 Conclusion Although both the Big Data and Semantic Web Technologies can be seen as two different areas of research, both are applied to real-life certain problems as it had been described in Section 4. Applications converge to a similar aim, namely, process and give meaning to the Big Data generated by the means of embedded technology. In addition, it can be seen that the main focus of applications of the both technologies is knowledge, may it be for profit or for the discovery of more knowledge. Therefore, it is worth to say that both areas should progress further by giving meaning to the unstructured, fast, and uncertain data around us. 8
6 Semantic Web technologies for the big data in life sciences Wu and Yamaguchi (2014) present a survey of big data in life sciences with semantic extension. The paper s aim is to enable investigation of effects chemicals on biological systems. Additional data sets are required to be able to accomplish that. The problem emerges when data sources contain different or new unseen data types and different formats of underlying data. To be able to use such data sources, they must be integrated, eliminating thus inconsistencies. To accomplish the task of data integration, considerable knowledge about that data is necessary to find what can be integrated and what cannot or should not. The author points out that the main problems in this context, as was also pointed out in the Section 2, are the volume and the rate of the generated data. The paper, thus, discusses the issue of how Semantic Web Technologies can solve the general problems of the Big Data Management that were outlined in Section 2. The paper later describes the technologies of the Semantic Web that were also listed previously in Section 3.2 along with examples for better visualisation of each technology. In addition to the previously described technologies, the paper presents some additional ones, for instance, linked data, triple stores, and triple stores in the cloud. Linked data tries to incorporate all data from World Wide Web into a single database and to make all the data semantically related in some way. Linked data uses the same basic technologies that were described previously for the Semantic Web. The basic idea is to allow connectionist approach to world. A simple example of that would be to give relevant related recommendations to users that are viewing some certain part of the web or searching for some particular information. Triple store is simply a database for all the triples. The triple store must allow for fast query execution, be scalable, and have a low load cost to be highly-operational. Triple store in the cloud is yet another paradigm that would allow users to connect to a cloud and, from there, use data or applications that are available on that cloud. Cloud computing can provide such services as: Data as a Service that would give access to the current data, Software as a Service that would allow users to use software instance from the cloud, Platforms as a Service that would allow users to exploit dedicated to 9
them area on the cloud to execute and test their software, and Infrastructure as a Service that would allow users to utilise execution power of the servers that host the cloud. Author then presents some of the examples of the technologies that offer triple stores in the cloud for scientific purposes. One of the major challenges for the field of the Semantic Web is that it was not designed to serve the Big Data requirements. To turn things around, additional concepts were introduced into the Semantic Web, such as: RDF and RDF Schema and/or OWL on top of RDF. The current issues that still persist in the field of Semantic Web is that the techniques cannot deal with the fast data and large data. Therefore, external solutions are sometimes employed to accommodate for the unpredictability of the Big Data. The author concludes saying that effective data processing platforms are needed to be able to process and share the data, especially in the research, as it had been pointed in Section 4.3. In addition, shared data must remain secure at all times. To increase the processing performance of the Big Data processing, parallel computing seem to provide some solutions in this area with such tools as, for example, Hadoop. In conclusion, the paper does extend the previously discussed chapter of relationships between two fields and is therefore another proof of that the both fields do cooperate when dealing with real-life problems concerning Big Data processing and analysis. 10
References Colin L Bird and Jeremy G Frey. Chemical information matters: an e-research perspective on information and data sharing in the chemical sciences. Chemical Society reviews, 42(16):6754 76, Aug 2013. ISSN 0306-0012. doi: 10.1039/C3CS60050E. I Celino, S Contessa, M Corubolo, and D Dell Aglio. Urbanmatch-linking and improving smart cities data. LDOW, 2012. URL http://planet-data.eu/sites/default/ files/publications/ldow2012-paper-10.pdf. CLP Chen and CY Zhang. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 2014. doi: 10.1016/j. ins.2014.01.015. URL http://www.sciencedirect.com/science/article/pii/ S0020025514000346. Data Science Series. Ten practical big data benefits. http://datascienceseries.com/stories/ten-practical-big-data-benefits, 2012. URL http://datascienceseries.com/stories/ten-practical-big-data-benefits. Y Liu, L Chen, X Luo, L Mei, C Hu, and Z Xu. Semantic link network based model for organizing multimedia big data. IEEE Transactions on..., 2014. URL http://www. computer.org/csdl/trans/ec/preprint/06786371.pdf. Maryam Panahiazar, Vahid Taslimitehrani, Ashutosh Jadhav, and Jyotishman Pathak. Empowering personalized medicine with big data and semantic web technology: Promises, challenges, and use cases. Proceedings:... IEEE International Conference on Big Data. IEEE International Conference on Big Data, 2014:790 795, Oct 2014. doi: 10.1109/BigData.2014.7004307. N Shadbolt, W Hall, and T Berners-Lee. The semantic web revisited. Intelligent Systems, 2006. URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber= 1637364. Hongyan Wu and Atsuko Yamaguchi. Semantic web technologies for the big data in life sciences. BioScience Trends, 8(4):192 201, 2014. doi: 10.5582/bst.2014.01048. 11