Generating Visualizations From RDF Graphs

Transcription

1 Generating Visualizations From RDF Graphs Zhuo Ma U Supervisor: Tom Gedeon, Armin Haller COMP8715 Computing Project Australian National University Semester 1, May,

2 ACKNOWLEDGEMENT I would like to express my greatest gratitude to my supervisors Tom Gedeon and Armin Haller for their enthusiastic and patient guidance. With their help and supports on this researching topic, I have gained much new knowledge. I would also like to thanks Dr Weifa Liang for guiding us the technical writing skills. In addition, I am very appreciated for the help from PHD student Anila Sahar Butt, and the supports from my family and friends. 2

3 ABSTRACT RDF language becomes increasingly significance for the studies of developing Semantic Web. For users to have better understanding in this area, this requires advanced methodologies and tools to visualize RDF data in a nice and intuitive way. In this project, we have designed a new method called Concept-Matching to visualise RDF graphs that contain schema and data in particular. We processed data from dbpedia database as an example to implementing this approach, and designed an algorithm to retrieve the required data. Moreover, we designed experiments to test the algorithms efficiency and worked on the algorithm optimization. Based on the results and analysis, we conclude that the new approach can be implemented successfully with the core algorithm. Keywords: RDF visualization, Semantic Web, Concept-Matching 3

4 CONTENTS Acknowledgement... 2 Abstract Introduction Background Related Work Non-graph based RDF Visualization Graph based RDF Visualization WebVOWL: Web-based Visualization of Ontologies RDF Graphs with LodLive VisualRDF Visual representation of RDF Discussion of Related Work Methodology RDF data structure analysis Retrieve concept- match information Construct Mapping model Build Base Layer Build higher layers Large graph layout - Five point scale Algorithm and Time Complexity Evaluation Experiment Environment and Steps Experiment Experiment Testing Environment Experiment Results and Analysis Optimization Conclusion and Future Work Conclusion Future Work Reference

5 CHAPTER 1 1 INTRODUCTION In recent years, data is being gathered from daily life as a general way to represent existing information and knowledge, and is frequently analysed in order to assist in making future decisions. The analysis of the web of data has attracted both data researchers and users. RDF language as a way to store web of data, can be used for the studies on Semantic Web development. However, because of the complex data structure in RDF, expert, let alone causal users often have difficulties understanding the details of RDF and employing the information they provide. From the humans perspective, to recognize and analyse information provided by Semantic web, the best and most friendly way is to implement visualization and exploration. The purpose of visualization is to convert and transform big data to visual representation that can be understood and interacted easily by humans. RDF visualization plays a key role to help users to better understand and interact with data. Many previous works usually visualize RDF data as a tree graph with linked nodes and edges, which will be discussed in more details in the section related work. While those approaches are well suited for small dataset, the visualization will result in very complex graphs and hard for users to manage and understand when dealing with large datasets that may contain billions of triples. In addition, as more and more data appeared to users, some visualization technic choose visualize the overall representation of a whole data set in order to avoid this issue. Although it may bring merits if users want to explore the high-level structure of a vast amount of information, it also causes problems when users want to discover the detail data. For instance, if users want to know the label and relations of source, then the overall representation of data would not be adaptable. In order to conquer these problems and provide a more intuitive, effective and user-oriented visualization for RDF data, we have developed a new visualization approach called Concept- Matching. This approach combines a Compound-Fisheye view to visualize RDF data as different size of bubbles around main source. In this paper, we present previously related work, the process of this methodology in details and analyze the efficiency of the core algorithm inside of Concept-Matching. In addition, we designed two different experiments to test the time complexity of our algorithm, and compared with the time that spend on retrieving real data. 5

6 CHAPTER 2 2 BACKGROUND RDF stands for Resource Description Frameworks, a data model, utilising metadata, being used to store and describe data resources on the web [1]. It stores the linked data as triples of class relations and uses URIs (Uniform Resource Identifier) to indicate relationships and two end links, which can and has been widely employed for different purposes by users of a variety of skill levels [2]. RDF is widely used to empower linked data in the development of Semantic Web as the evolution of the World Wide Web, which is proposed by Tim Berners-Lee s article in 2001 [3]. In the humanities, the word semantic refers to the distinctions and similarities between the meanings of words. The term Semantic Web, therefore, refers to a web of meanings. The Semantic Web can be considered as a web of data, which provides a common framework for data integration and combination across various applications [4]. The reason for developing the Semantic Web is that different data is stored and controlled by different applications with little communication between them, resulting in the World Wide Web being otherwise unable to provide precise information to deal with users semantic requests. For example, search anything on the Internet or on application databases by sending the semantic request find another person who called Obama. The information retrieved will be based on all keywords and it will not provide precise data. This result is due to the fact that the search engine could not understand the user s request, because the different domains lack integration and consistency. Thus, to deal with more complex search terms, data across different domains needs to be shared and understood and for this reason, the Semantic Web is critical to the Internet revolution. Figure 1: Evolution on the Web [5] 6

7 Figure 1 shows that the Semantic Web refers also to Web 3.0 as being the evolution and extension of Web 2.0, and aims to link separate data on the Web through URIs (Uniform Resource Identifiers) in order to achieve better search results regardless of language, in terms of sharing and reusing data on the Web [6]. The World Wide Web Consortium (W3C) has developed the Semantic Web Architecture, illustrated in the figure below, to developers to assist in the development of technology. Figure 2: Semantic Web Architecture in layers [7] The first layer, Unicode and URI, is used to standardize development languages used in the web and to identify original web resources, and then the higher layers are extrapolated from lower layers [8]. The core layer for Semantic Web is the third one, RDF and RDFs (RDF schema), which provides the standard format to represent metadata about web resources [8]. For different website, the data can be merged if data exists in both, and others can be linked together if the data are relevant, and finally will group up to a huge data structure such as ontology models that are syntactically based on RDF. As such complex data structures in RDF, the visualization of RDF still has problems, as mentioned in Introduction section. A later section will demonstrate the related works done by others and discuss how our work is different. 7

8 CHAPTER 3 3 RELATED WORK Research on visualizing linked data has become a popular area over the last few years. Numerous research works considered better and more intuitive visualization of ontologies for end users, either for experts in Semantic Web or casual users who are interested in this area [9][10][11]. Meanwhile, many visualization tools are developed to make better performance of visualizing RDF data, which is generally classified into two aspects, non-graph based and graph based approaches respectively [12]. 3.1 NON-GRAPH BASED RDF VISUALIZATION Non-graph based methodology presents data in a logical sequence containing facets, categories and subject description, such as in the Haystack [13] and mspace platform [14]. In addition, in the article The Pathetic Fallacy of RDF [15] by David and Schraefel also mentioned that these non-graph based approaches will have better performance than some traditional graph based tools such as RDF-gravity [16] and IsaViz [17] since the graph will become massive as the size of RDF data becomes larger. For instance, the figure 3 shows the FOAF (Friend of a Friend) Vocabulary Specification [18] as a table of data. Figure 3: FOAF Vocabulary specification data table on Protege 8

9 Although in some aspects non-graph based representation may have more merits than graph based representation, we still believe the future should be graph-based visualization that provides users with more intuitive feeling. Moreover, the graph-based method will have clear representations of overall structure, interrelationship, patterns and trends [19]. Thus, the next section will discuss the benefits and weakness of graph based visualization tools in more detail. 3.2 GRAPH BASED RDF VISUALIZATION As this area has become a popular research topic in recent years, there are many tools developed for RDF visualization; such as WebVOWL [20], lodlive [21], and the JavaScript based Visual RDF [22]. The next few sections will show how the FOAF ontology can be viewed by the use of those three tools and explore if this graph can be the best representation WEBVOWL: WEB-BASED VISUALIZATION OF ONTOLOGIES WebVOWL is a standalone application for the user-oriented visualization of ontologies, which is based on the web technologies and D3 visualization library [20][23]. It uses force directed algorithm to formalize the data graph layout with the implementation of Visual Notation for OWL Ontologies (VOWL) that identifies the visual language for Ontology visualization [24]. The Visual Notation for OWL Ontologies (VOWL) contains graphical primitives and color scheme ingredients to form the basic constructions, which are shown in figure 4 (a) and (b) [24]. Figure 4: Graphical primitives and Color scheme for VOWL [25] 9

10 Graph Primitives: VOWL uses a list of symbols to demonstrate ontology concepts. The circles represent classes and the labeled arrows represent property relations between different sources. The ontology only has two types of objects, which are datatype that usually use literals and object property that contains URI. Thus, the object property still use circle to be visualized but the datatype is depicted as rectangles. Color Scheme: Many studies shows that the color chosen may make it easier for users to interact with different elements. For example, the color red is often used to attract attention and is therefore used to illustrate highlighted elements. Based on the FOAF vocabulary specification, the visualization graph with the usage of VOWL notation using a force-directed layout is presented in Figure 5. Advantage Figure 5: Friend of a Friend (FOAF) visualization in WebVOWL It clearly shows the overall structure of FOAF ontology. The basic details including information about FOAF, metadata and the graph statistics can be found on the side bar. When click any element on the graph, the corresponding information will be showed under the Selection Details such as its name, type, domain and range. Disadvantage The graph will become very complex and it is hard to find the useful information as the data becomes larger. 10

11 3.2.2 RDF GRAPHS WITH LODLIVE LodLive can parse RDF resources whether they are stored in a SPARQL endpoint, and generate user-oriented graphs with the use of proper navigation model throughout the data [26]. This tool uses a JavaScript application layer without using any application servers to browser a SPARQL endpoint, which transforms any configured endpoints to JSON format in order to parser to JavaScript and visualization in an HTML5 web page [26]. LodLive is comprised of 5 different components [26]: LodLive-core.js: jquery plug-in LodLive-profile.js: JSON configuration map HTML5 page Few images sprites Some other jquery public plug-ins LodLive operations In the first place, choose a database and an endpoint such as FOAF class to retrieve the URI and access the detail of FOAF. Figure 6: Single endpoint search panel After the endpoint request, JSONP is called to generate a central circle representing the main class, and many small circles representing Object properties surround the core class. 11

12 Figure 7: Central class with surrounding object properties The object properties can be expanded by user s interaction with small circles to display more data and each new resource is connected with the main class through an arrow representing the value of given properties. Figure 8: Object Properties in FOAF expanded 12

13 Advantage Use dynamic visual graph to traverse RDF data with users interaction Discover relations in the linked data step by step For the different resources, there is also corresponding description that show relevant type and comments Inverse relation between different resources is showed with arrow going back and forth Disadvantage Does not visualize the whole FOAF ontology Hard for casual users to understand since all URI appear in each circle rather than labels Graph will become complex and hard to be visualized as more object properties are expanded VISUALRDF VISUAL REPRESENTATION OF RDF VisualRDF is developed by Alangrafu at 2014 [27], which use D3 JavaScript library [28] for a nice data visualization and ARC2 [29] for parsing RDF. Operations This tool provides a easy model to be operated, which only require users to type a URI Figure 9: Single URI access panel The overall graph of data about FOAF will be generated automatically 13

14 Figure 10: Overall FOAF visualization graph There is also a function panel provided to help user better interact with the graph Figure 11: Function panel The details of each node can be displayed while move the mouse to its position Figure 12: Node details graph 14

15 Advantage: Easy for users to operate. Easy to display the basic structure of the linked data model automatically. Disadvantage: Many intersection lines. The relations between classes are vague. The graph become disorder when dealing with large dataset DISCUSSION OF RELATED WORK By investigating the related work, most visualization tools focus on the whole ontology visualization, but only few tools provide the comprehensive and specifying visualization model. All these tools are trying to implement the classes and properties in a nice and clear way. However, there also exist some major deficiencies: The graph become hard to be recognized while the data become larger. Redundant properties are showed by arrows. No clear visualization of individuals. Most tools have implemented the visualization approach Visual Information- Seeking Mantra overview first, zoom and filter, then details on demand [30]. It provides users with an overview of the whole ontology and then allows users to explore each node accordingly. To consider these deficiencies, we developed the Concept-Matching approach to visual data as different size of bubbles center around a main class while all the instances will be showed by subsequently exploring the bubble in depth. Moreover, for considering the user-oriented visualization, we also introduce the method Compound-Fisheye Views [31] on the tree map to visualize large graphs when there is a large amount of triples in the to be visualized RDF graph. Another important fact is that our approach mainly focuses on the users who have less knowledge about the Ontology and RDF, rather than most tools developed for experts. 15

16 CHAPTER 4 4 METHODOLOGY Overcoming the shortcomings of those tools that are mentioned in the previous chapter while also finding a new approach to visualize RDF data is also the purpose of our project. To achieve this, we have done extensive research especially on the RDF data structure analysis and data visualization approaches. Finally, we come up with the idea to use bubbles to represent different type of data and use Concept Matching method to restrict the size and content of bubbles. In this chapter, we will use endpoint Canberra as resources from the dbpedia database as an example to explore our approach. In the first sub section, we will analyse the basic RDF data structure and in the following sections we will describe the process of building the visualization model. Figure 13 presents the high-level structure of the methodology in this project. Figure 13: Structure of the Concept-Matching methodology 4.1 RDF DATA STRUCTURE ANALYSIS This part will explore basic RDF statement and the RDF Model including objects as both literals and resources, and illustrate how a SPARQL query can be used to find the necessary information. 16

17 The RDF Statement Triples RDF/XML stores data as triples: Subject, Property and Object. For example, a simple sentence The author of is Jan Egil Refsnes will have a triple relation as follows: Subject (Resource) Property (Predicate) Object (either literal or resource) Author Jan Egil Refsnes Table 1: Subject-Property-Object We adopt another example generated from the resource Canberra in dbpedia database ( which is shown below in terms of some simple RDF statements: Figure 14: Simple RDF Statement for Canberra 17

18 Interpreting this RDF statements Subject: Property: dbpedia-owl:country Refers to dbpedia-owl:populationtotal dbpedia-owl:wikipageid Refers to Refers to Object: dbpedia-owl:date Refers to September 2011 RDF Namespace URIs Line 4 xmlns: rdf= shows the standard W3C namespace, which indicates that the enclosing document is an RDF document tagged by rdf:rdf. Moreover, the namespace xmlns:dbpedia-owl specifies the elements with the dbpedia-owl prefix RDF Model The set of statements inside the RDF documents can be viewed as a directed labeled graph since the data is stored as triples. The resources including subject and object are represented by nodes and all properties are presented by edges. Thus, the above RDF can be illustrated in figure 15: 18

19 Figure 15: Simple Canberra RDF model We can see the graph becomes very difficult to parse when stating it with fully qualified URIs, so we adopt namespace prefix as labels for representing each node and edge to make the visualization simple and clear. SPARQL query SPARQL in terms of SPARQL Protocol and RDF Query Language is the W3C recommendation language for RDF query [32]. SPARQL is similar to SQL, which allows us to use the query words including that the use of SELECT clause choose which set of data should be queried and the use of WHERE statement find a match through the query data set. For example, we can use the following query to return every person s name in the FOAF database. Figure 16: Name Return Query on FOAF database 19

20 This query will search all the triples in FOAF database, and return each person s name. It notes that SELECT?name clause request all the variables names return from the set found in WHERE statement. The statements inside WHERE are also triples formats; for example?person foaf:name?name searches all the persons who have names, as well as the statement?person a foaf:person that a is a type predicate. 4.2 RETRIEVE CONCEPT- MATCH INFORMATION In our Concept-Matching visualization approach, we only show the important concept related to the resource as bubbles around the central class and we use the number of instances that a concept has to decide the size of bubbles. Thus, in order to retrieve the necessary concepts most relevant to the resource, we are supposed to retrieve the number of instances count for each concept and the relations between each concept and its sub concepts. We choose the endpoint Canberra as the resource from the dbpedia database to retrieve its instance count and concept relations. To accomplish this, we need to use the Virtuoso SPARQL Query Editor [33] for querying the dbpedia database. To get properties and their count attached to a type that exported to file InstanceCountPerType.csv, we wrote the SPARQL query language as it is shown on Figure17: Figure 17: SPARQL query for Instance Count Per Type of Canberra 20

21 To retrieve the type and subtype relations among those concepts/types related to Canberra in dbpedia, we wrote the SPARQL query to export it to file Concept-Subconcept.csv, which is showed as follows: Figure 18: SPARQL query for concept relations of Canberra 4.3 CONSTRUCT MAPPING MODEL In this section, we built a program to scan both the InstanceCountPerType.csv (a set of triples) and Concept-Subconcept.csv (a set of concept-relations) dataset to get the concepts that are most relevant to the source Canberra in which to draw the different layers in Canberra data visualization. To have a better understanding of what is the most relevant concept to resource Canberra, we used a simple example to illustrate: Figure 19: Canberra-ANU demo 21

22 As the above graph showed, Canberra has school ANU and ANU has concept University but University is a sub concept of Organization. Thus, the most relevant concept to Canberra is University. The next section explains the way to retrieve most relevant concept from the two dataset of Canberra. The data in two dataset looks like graph 1 and graph 2 in Appendix A. We separate this process into two different stages BUILD BASE LAYER Constructing the base layer requires recursive iteration through the dataset. The overview process model is shown as the following figure: Figure 20: Process mode of building base layer 22

23 Process 1 Filter process When we analyzed set of triples, we found there many concepts that have URI not only from dbpedia database but also from other source. We filtered all the concepts that are not starting at since we dealt with dbpedia resources. Thus after this process, we will have a new dataset that only contains all the concepts with URI starting at Process 2 Ranking process (deal with InstanceCountPerType.csv dataset) Situation 1: ranking the number of instance Firstly, we rank the number of instance from the largest to smallest, which is shown on Figure 21: Figure 21: First 10 lines of data in InstanceCountPerType.csv dataset If there two concepts that have the same number of instances and property, those concepts will be waited for checking their relations. For instance, the concepts Agent and Person have the same number of instance 186 and same property birthplace. Thus, the relations between these two concepts will be compared, and the program will return the concept that is the sub concept of another. Situation 2: ranking the property If various concepts that have the same number of instance but different properties as it is shown on Figure 22, we ranked their property. Figure 22: Concepts with instance count 92 23

24 Thus, there will have two pairs of concepts (Pair 1: PhysicalEntity and CausalAgent ; Pair 2: Person and Agent ) need to be checked for their relations, since the pair of concept has the same number of instance and same property. Situation 3: more than two concepts with same number of instance and property If more than two concepts have the same number of instance and property, each pair of concept need to be checked for its relation. Figure 23: Concepts with instance count 145 The Figure 23 shows many concepts have same instance count and property, so each concept needed to compare with the others. Finally, the program returns any one concept that is the sub concept but not being a super concept. For instance, (eg. A->B, C->D, E->F, B->C where -> stands for is sub concept of), it returns either A or E. Process 3 Check concept relations: return the most relevant concept For each pair of concepts, we scan the concept-relations dataset to check its relation. For instance, in the situation 1 above, both concepts Person and Agent with the same number of instance and property are waiting for check the relation. Then, by searching concept-relation, it found the relation that is the sub concept of Then, the program returns the concept Base Layer demo After we apply the steps described above, we got a list of triples (concept property number of instance). We add all the number of instances together when their concepts are the same and keep the record of their properties. For example, the property birthplace has 186 instances for concept Person and the property deathplace has 124 instances for concept Person. In this way, we can calculate the concept Person has the largest number of instance that will be showed by the largest size of bubble, and we still keep the record of their relations. The demo of the base layer will look like: 24

25 Figure 24: Base layer demo The concepts filtered are most relevant and important to Canberra and are arranged by how many instances they have. The black dots between concept labels Organization and Dom here mean that many bubbles are omitted in this demo. At the base layer, if it still can be expanded, then we chose not to show either the property arrow or instances, unless they cannot be expanded any more. This will be discussed more on the next section BUILD HIGHER LAYERS As mentioned above, if the bubbles around Canberra in the base layer can be expanded further, and then it has go through the process to build its higher layers. To illustrate this, we chose to expand the concept Person in the following. When a user clicks the Person bubble, it will show its SubConcept as various sized bubbles around it. We built a recursive method to expand the higher layer as below. Step1- finds sub concepts We use the program to search the concept-relations to find all the concepts that are SubConcepts of Person, which is shown on table 2 below. SubConcept Concept 25

26 Step2 finds number of instances Table 2: SubConcepts of Concept Person Now back to read the dataset InstanceCountPerType.csv to looking for how many instances those sub concepts have, in order to decide the size of bubbles around source Person ; simultaneously, the property and relations are recorded for the instances visualization. Then, we run the program to get the triple relations that are shown in Appendix B. The total number of instance for each concept is: Concept/Type Total No of Instance Table 3: Total number of instance for sub concepts of concept Person Step3 visualizing Person After retrieving those data, the higher layer concept label Person will be generated based on the number of instance, which looks like: 26

27 Figure 25: Higher layer of Person demo Recursive step After reaching the second level of source Canberra, we check if those sub concepts of Person could be expanded further. If any concept that represented by bubbles could be expanded to the next level, then the above process is repeated to determine what concepts would be involved and use pointers to record the properties and instances. The instances and properties will be shown until no concept has any more sub concepts. Figure 26: Higher level of concept Artist 27

28 For instance, the Figure 26 illustrates the next level of concept Artist where we found that the concept Writer does not have further sub concepts. Thus, when a user clicks on Writer, it will not show any more sub concepts around the Writer bubble; instead, the properties and instances that have concept Writer will be shown by arrows and rectangles. The instances are retrieved from the endpoint Canberra in dbpedia database. The graph looks like: Figure 27: The instances with concept Writer Place of Death relation means that Bryce Courtenay who was a Writer died in Canberra. We used the asterisk to represent that there is more than one instance that connect with Canberra and show the instance directly if only one instance exist. This method is briefly explained in [12]. Since we used the Compound-Fisheye Views [31], other bubbles will become far small than the one that the user is focusing on LARGE GRAPH LAYOUT - FIVE POINT SCALE When the data become large such as vast amount of concepts, we decided to use a simple ranking method to strict the grapy layout. 1) According to the first letter of concepts label, we separated the concepts into five different bubbles such as the following Figure: 28

29 Figure 28: Base layer by character 2) When a user clicks the bubble with label A-E, then the labels of concepts that starts from A to E will be showed respectively. Figure 29: A-E graph expanded 3) If there are still many concepts in the bubble A (such as 20), we compared the concepts second letter and separate to another five bubbles. It is shown in Figure 30 when a user click the bubble A : Figure 30: A graph expanded 29

30 This graph layout method combine with Compound-Fisheye view technic would works properly for visualizing large dataset. 4.4 ALGORITHM AND TIME COMPLEXITY Since we are dealing with a huge dataset in RDF, an effective algorithm is to be designed in order to decrease the complexity time in finding the data relations. For implementing the approach illustrated above to cope with the real RDF dataset Canberra, we considered the way that use list structure inside the hash map. Firstly, we used hash key to record the number of instances and used lists to contain concepts as the hash value, and then travel the Concept- Subconcept dataset for each list to find relations. However, by running the real data, the time consumption is very high and costs a quite long time to produce the result. Therefore, we redesigned a completed different algorithm that will be explained in details at the next part Algorithm. Algorithm We tried different ways to reduce the time complexity. Finally, by comparing the efficiency on different algorithms, we designed an appropriate algorithm that has the follow steps: Construct the Concept- Subconcept relation to be directed graphs. (Conceptrelation graph) As considering the time and space complexity, we used quick sort to sort the instances data. Construct a Breath- first search (BFS) algorithm to search the graph to find the required concept. The pseudo code is shown below: Pseudo-code for ranking instance data RANKING (Instance_DATA, p, r) 1 if p < r 2 then q PARTITION(Instance_DATA,p,r) 3 RANKING(Instance_DATA,p,q-1) 4 RANKING(Instance_DATA,q+1,r) RANKING() modify quick sort to rank the given data in a set of triples dataset. We modified the partition exchange sort in quick sort to get: Pseudo-code for partition-exchange instance data 30

31 PARTITION(Relation_DATA,Instance_DATA,p,r) 1 x Instance_DATA[r] 2 i p-1 3 for j p to r do remove 0 5 if ISGREATER(Relation_Data,Instance_DATA[j],x,remove) 6 then i i+1 7 exchange Instance_DATA[i] <-> Instance_DATA[j] 8 if remove = 2 9 then removelist {Instance_DATA[j]} 10 else if remove = 1 11 then removelist {x} 12 remove 0 13 ISGREATER(Relation_Data,x,Instance_DATA[i+1],remove) 14 exchange Instance_DATA[i+1] <-> Instance_DATA[r] 15 if remove = 2 16 then removelist {Instance_DATA[i+1]} 17 return i+1 Firstly, we compared the number of instance. If the numbers of instances are the same, then we compared the property. Until the properties are the same, then we used BFS (Breadth-first search) to search the concept-relation graph. Firstly, compare number of instance, return true if data1.no > data2.no. Return false if less, and move to next step if equal Secondly, compare property when number of instance is the same. Return false if different property and go to next step if same property Lastly, compare their relation if they have same property and same number of instance. Recall BFS in this step. Pseudo-code for instance and property comparison ISGREATER(Relation_Data, data1,data2, remove) 1 if data1.noinstance > data2.noinstance ##compare No. of instance 2 then return true 3 else if data1.noinstance < data2.noinstance 4 then return false 5 else if data1.property!= data2.property ##compare property if No. of instance 6 then return false ## are same 7 else if BFS(Relation_DATA, data1.concp, data2.concp) ##compare concept, 8 then p = 2 ##if property are same 9 return true 10 else return false We modify the Breadth-first search algorithm to search the concept graph in order to get the required concept. Pseudo-code for searching concept-relation graph 31

32 BFS(Relation_DATA,data1,data2) 1 for each vertex u Relation_DATA[G] - {s} 2 do color[u] WHITE 3 color[data1] GRAY 4 Q 5 ENQUEUE(Q,data1) 6 while Q!= 7 do u DEQUEUE(Q) 8 for each v Adj[u] 9 do if color[v] = WHITE 10 then color[v] GRAY 11 if v = data2 12 do return true 13 color[u] BLACK 14 return false Complexity Analysis The first process to build a directed graph from the dataset of Concept-Subconcept relations costs O(n) where n is the number of lines, since we read this dataset line by line. For each line, an edge will be crated between two nodes and the node will be added before adding the edge if the vertex does not exist. For ranking the item number of instance and property in the dataset of InstanceCountPerType, we chose to use Quicksort as we consider both time complexity and space complexity. The average case performance of using Quicksort cost O(n log n) and the worst case would be O(n^2). Although the worst case for other sorting methods such as Merge sort and Heapsort cost time complexity of O(n log n), their space complexity is up to O(n log n) unlike Quicksort which has a space complexity of O(log n) even in the worst case. That s because Merge sort use O(log n) stack space and the extra O(n) space for storing array, so the total space complexity is O(n log n). The same reason when using heap sort, it takes O(n log n) space to build the heap tree structure and use O(1) auxiliary space. Thus, the use of Quicksort can save a large amount of space especially on dealing with large dataset. In each process of Quicksort, we also need to recall Breath-first search for finding the concepts relation if it needs concept comparison. The BFS algorithm requires the time complexity of O( V + E ) in the worst case where the V is the set of vertex and E is the set of edges. In this Concept-Subconcept dataset, V is the set of concepts and E is the set of concepts relations. Therefore, the total time complexity of implementing the algorithm to retrieve the required data is O(( V + E ) * n log n). 32

33 CHAPTER 5 5 EVALUATION When we use the real data to test this visualization approach, we found that the algorithm efficiency could be the most difficult task to overcome while dealing with a large dataset. To test the usability of our algorithm implemented above, we designed two controlled experiments. The next few sections will explain the details of the experiment, analyse the experiments results and illustrate the ways to optimize the algorithms. 5.1 EXPERIMENT ENVIRONMENT AND STEPS In the first place, we have briefly view on how the graph of Concept- Subconcept relations looks like. Example: Subconcept A B B D C C E F Concept B C D F F E G G Figure 31: Directed graph When the number of concepts and subconcepts become large, the relations become very complex. Simultaneously, the time consumption of using BFS to traverse the graph also becomes larger. We designed two controlled experiments and used the experimental datasets to test the time consumption when increase the number of concept relations in Concept-Subconcept and the number of triple data in InstanceCountPerType respectively. 33

34 5.1.1 EXPERIMENT 1 We kept the number of triples (concept/type, property and number of instance) in InstanceCountPerType dataset as a constant at 5000 triples, while continuous increasing the number of data (lines) in Concept-Subconcept dataset by adding 100 data every time from 200 up to We check the time it costs through increasing the number of relations, which has the following steps: 1. Build 5000 numbers of triples data using recursive function, and the sample is shown in Appendix C graph Generate the different random relation of the concepts from those triples, which is shown in Appendix B graph 2. First trial, we generated 200 relations. 3. Run the program to test these two dataset to check what time it cost 4. Keep the number of triples and increase the number of relations by 200 and record the time consumed EXPERIMENT 2 We used the dataset generated from experiment1 in experiment 2 as follows: We kept the number of data (relations/ lines) in Concept- Subconcept data set as a constant at 1000 different lines, Increase the number of triples in InstanceCountPerType dataset in 500 steps from 500 to data. Record the time cost for each point (500, 1000, 1500 ). 5.2 TESTING ENVIRONMENT Experiments are processed via a Java program on a Mac system with the following specifications. Hardware / Software Information Eclipse Standard Version 1.0 Java SE Development Kit Jdk1.7.0_51 OSX Yosemite Version Processor 2.4 GHz Intel Core i5 Memory 4GB 1333 MHz DDR3 Graphics Intel HD Graphics MB Table 4: Testing environment 34

35 5.3 EXPERIMENT RESULTS AND ANALYSIS In the experiment 1, it takes a long time to produce the final results when the numbers of relations become very large. Compared with the experiment 1, producing the final results in experiment 2 takes shorter even with data increased. The results are shown in the following Figures: Figure 32: Result of experiment 1 Figure 33: Result of experiment 2 The results from two experiments show that the time cost in experiment 1 increase faster than in experiment 2 as the number of data increased. In experiment 1, when the number of data (relations / lines) increased, the trend of time cost is showing an exponential growth pattern. In experiment 2, along with the number of triples (concept, property and number of instance) increased, the trend of time cost is linear. When we test our algorithm with real dataset such as testing with Canberra (contains 5760 triples and 1674 lines of concept relations), the result is shown in Figure 34, Figure 34: Time cost for running Canberra The time consumed of testing real dataset has matched the time cost in experiments. 35

36 Although there exit some deviation on few data that may be due to the CPU efficiency, the trend of those results still satisfy the time complexity of our algorithm O(( V + E )*n log n). We can calculate the time complexity from O(( V + E )*n log n) for each experiment. O(( V + E )*n log n) where V is the set of vertex, E is the set of edges and n is the number of triples. The worst case for building the graph is E = V *( V -1)/2 For experiment 1, n is a constant C1, O V + E n log n = O V + V V 1 2 = O 1 C! logc! 2 V 2 + V C! logc! = O M! logc! Where M! =!! V 2 + V C! Time complexity becomes exponential function. For experiment 2, the number of relations is a constant, so V + E = C2, O V + E n log n = O (C! n logn) Time complexity becomes linear function Therefore, from the results analysis, the time complexity in our algorithm with the real time is totally matched. Special situation However, time consumption is still high due to that colouring vertex in BFS for searching directed graphs cost the most time. Colouring vertex in BFS is necessary when dealing with circles in directed graph such as circle B- C- F- D- B in Figure 35: Figure 35: Circle in the directed graph While the network may not exist a relation such that an object is a subset of another object and also the object itself is a superset of the other, we may ignore the steps of colouring vertex. Then, the time cost will be: 36

37 Figure 36: Time cost without using colouring In order to design a good approach, we here considered the entire possible relations including the circles. Thus, we did pruning on the algorithm to reduce the frequent use of BFS in order to promote its efficiency. The next section will illustrate the details of using pruning to optimize our algorithm. 5.4 OPTIMIZATION We used the pruning approach to decrease the searching steps such as the times of calling BFS, there three main steps. Step1 1. Create another graph called no relation graph to contain a set of unrelated vertices. 2. Update relation graph 3. Avoid the adjacent vertex to be null We created another graph to record the concept as vertex and all other concepts that has no relations with the target concept as its adjacent vertex during each time to call BFS for traversing the directed graph to compare the relation between pair of concepts. Then, we can check if the concept is included in the no relation graph before calling BFS to check if two concepts have relation. If it appears as vertex and edges in no relation graph, then it does not need to call BFS and can return no relation between these two concepts. For example when we used BFS searching the graph (Figure 31) to check if C is the sub concept of B, it needs to traverse the entire vertex in the graph and finally return False and generate a no relation graph to concept C such as Figure 37. Figure 37: No relations to concept C 37

38 When check if C is the sub concept of A, the program does not need to use BFS to search graph; instead, it check no relation graph first and found that A is the adjacent vertex of C and then return False. This way essentially reduced the times of calling BFS. Step 2 When comparing two concepts (A and B) by calling BFS, it searches if the concept A has a super concept that has distance more than 1 to A. If it does, it updated the original graph by adding the super concept to become an adjacent vertex of the concept A. For example in Figure 31, when we check if C is the sub concept of B, it recalled the BFS to search the directed graph and update the original graph to be: Figure 38: updated graph to vertex C Thus, it can save times of using BFS for finding path while comparing relation between C and G. Step 3 When check if concept 1 is the sub concept of concept 2, the program checked if the concept 1 in directed concept-relation graph has no adjacent vertex. If the concept 1 has no adjacent vertex that means it has no super concept relation, the program returns False directly and it does not need to call BFS. For example (based on the Figure 31), if the program needs to check if concept G is the sub concept of F, the program will return False directly without calling BFS for searching. The reason is that in the directed graph, the vertex G has no adjacent, which present G has no super concept. Those three steps are the most important steps in pruning methodology, which has the core strategy that is to reduce the times of calling BFS for searching. Since the BFS searching always consume a large amount of time, reducing the frequent use of BFS can make huge contribution on reducing the time complexity. We have applied the pruning steps on optimizing our algorithm. The details of the updated algorithm and the result of testing Canberra dataset can refer to Appendix D. 38

39 CHAPTER 6 6 CONCLUSION AND FUTURE WORK 6.1 CONCLUSION This project, generating visualization from RDF graphs, is going to explore a method to visualize RDF graphs that contain schema and data in particular. We started it from scratch, and did enormous researches on RDF data structure and data visualization. Most previous works on RDF visualization have the same major defect that the graph will become disorder and hard to be recognized along with the size of data become larger. To overcome those shortcomings, we have developed a new approach Concept-Matching that use bubbles to represent RDF data and use the importance of data to decide the size and position of bubbles. In our approach, we found one of the most difficult things is to implement a high-efficiency algorithm to retrieve data for the implementation of this method since the size of RDF dataset always be very large. We combined the use of Graph layout algorithm, Quicksort and Breadth-first search algorithms to improve the efficiency on retrieving data. From our experiments, we discovered: Experiment 1: When numbers of concept-relations stay the same, the time complexity appear exponential growth as the number of triples data increased. In this situation, the algorithm we developed is only suitable for calculating small dataset but not working properly for large dataset. Experiment 2: When numbers of triples data stay the same, the time complexity appear linear growth along with the number of concept-relations increased. In this situation, the algorithm is working properly for both small and large dataset We still found the time cost on retrieving data is quite high, so we did pruning to decrease the times of using BFS; simultaneously, the time consumption has been deduced. In conclusion, although the time complexity of implementing the algorithm is not as fast as we expected, new approach Concept-Matching still can be a good way to visualize large RDF dataset in a nice way. 6.2 FUTURE WORK By the experiments, even the process of methodology works properly, but we still need survey various users with HCI experiments. In the future work, firstly we would like to design sorts of Human-Computer Interaction experiments to test the useability of the Concept-Matching approach and the effectiveness of the graph layout including the Five 39

40 Point Scale approach. We can mainly focus on casual users and gather more data on the feeling of using this method to visualize RDF data. Moreover, if we ignore the circle relation, we do not need to colour vertices while calling BFS. As we tested, the running time will be decreased to less than few seconds via this way. To consider this fact, we would like to design some specific experiments to test what kind of data should use colouring and what kind of data can ignore this relation. Finally, we would like to develop a visualization tool to implementing this approach. 40

41 REFERENCE [1] W3schools. [2] W3C Semantic Web. [3] Semantic Web part of business world 2010, viewed 15 March 2015, < [4] W3C Semantic Web Activity. [5] Casellas, N 2011, Semantic Enhancement of Legal Information, Legal Information Institute, Cornell University Law School, viewed 16 March 2015, < [6] Coudyzer, E. (2013). First release GLAM sector reference terminologies, viewed 16 March 2015, < > [7] Berners-Lee, T, Architecture, W3C, viewed 17 March 2015, < > [8] Obitko, M 2007, Semantic Web Architecture, viewed 16 March 2015, < > [9] Dadzie, A & Rowe, M. Approached to Visualising Linked Data: A Survey, IOS Press, Semantic Web 1-2, [10] Geroimenko, V & Chen, C. Visualizing the Semantic Web: XML-Based Internet and Infor- mation Visualization. Springer, 2nd edition, [11] Janowicz, K., Schlobach, S., Lambrix, P & Hyvonen, E. Knowledge Engineering and Knowledge Management: 19 th International Conference, EKAW 2014, Linkoping, Sweden, Novermber 24 28, 2014, Proceedings. Springer International Publishing AG, [12] Sundara, S., Atre, M., Kolovski, V., Das, S., Wu, Z., Chong, EI & Srinivasan, J. Subsets, Summaries, and Sampling in Oracle. IEEEXplore ICDE Conference, [13] Quan, D., Huynh, D & Karger, DR. Haystack: A Platform for Authoring End User Semantic Web Applications. In Proceedings of the 2 nd International Semantic Web Conference, 2003, pp [14] Schraefel, M., Smith, DA., Owens, A., Russell, Alistair., Harris, C & Wilson, M. The Evolving mspace Platform: Leveraging the Semantic Web on the Trail of the Memex. Proceedings of the sisteenth ACM conference on Hypertext and hypermedia, 2005, pp [15] David & Schraefel, The Pathetic Fallacy of RDF, viewed 27 March 2015, < > [16] RDF Gravity. 41