JCR or RDBMS why, when, how?

Transcription

1 JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories (JCR) and relational database management systems (RDBMS). The choice between these technologies is often made arbitrarily. The aim is to clarify why this choice should be discussed, when one technology should be selected instead of an other and how the selected technology should be used. Four levels (Data model, Specification, Project, Product) are analyzed to show the impact of this choice on different scopes. Follow a discussion on the best choice depending on the context. This defines the foundations of a decision framework.

2 2 Table of content Table of Contents 1 Introduction What is compared? Why is it comparable? What is the purpose of this comparison?3 1.4 How will it be compared? State of the arts Roles Domains of responsibility Data Models Data model comparison Model Definitions Structure Integrity Operations and queries Navigation Synthesis Specification comparison Use Case Definition Structure Integrity Operations and queries Navigation Transactions Inheritance Access Control Events Version control Synthesis Development process comparison Data Understandability Coding Efficiency Application Changeability Synthesis Product comparison Theoretical analysis Benchmark Synthesis Scenario Analysis Survey Reservation Content management Workflow Conclusion Appendix JCR and design Model Convention Methodology Application Appendix Going further Queries in semi-structured models Queries on transitive relationships Modular and configurable databases Bibliography... 49

3 University of Lausanne & Day Software AG JCR or RDBMS 3 1 Introduction Day Software AG (Day) led the development of a JAVA specification which defines a uniform application programming interface (API) to manage content. This specification is called content repository API for java (JCR) and is part of the java community process. Implementations of this specification are actually provided by well known companies such as Oracle, Day or Alfresco. JCR implementations are often used to build high level content management systems and collaborative applications. Day also provides an open source implementation of the specification which is called Jackrabbit and which is used as a shell for some of its products. This diploma thesis takes place in this context. Day wants to clarify some points which relate to the data model promoted by their specification. The basic idea is to compare their approach to managing content with the approach promoted by competitors at different levels. The following sections will clarify the approach adopted to do this and give an overview of the content developed in this report. 1.1 What is compared? As explained, the purpose is to locate JCR in the database world. This work will be done by comparing the relational model and the model promoted by JCR. The relational model defined by Codd in the 70 s is actually the most widely used data model. The unstructured or semi-structured model subtended by the JCR specification encounter a growing success in the content management area. These two models will be described and analyzed in this report. 1.2 Why is it comparable? Each data model supports a philosophy, to structure and access data. On the one hand, the success of the relational model comes in large part from the facilities which are offered to describe clear data structures. On the other hand, the success of the JCR specification relates essentially to the facilities which are offered to express flexible data structures. These aspects show us that the discussion takes place at the same level. Thus, it makes sense to compare them, and to clarify their respective possibilities and limits. It also makes sense to give a clear picture of their respective philosophies which are promoted and used by each of the models. 1.3 What is the purpose of this comparison? By making this comparison, Day wants to more precisely position the data model, the specification and the products which relate to JCR. Doing this should help people to understand better the main offers available on the market and show when it make sense to use them. More precisely, with an external perspective, the goal is to define and give clear advice, which can help people to choose the approach which will best fit in with their needs. Some people are asking if their applications should be implemented with a relational database or a java content repository. Thus, clarifying the philosophies promoted by each model could help in making good decisions and understanding the impact of a choice made at a data model level.

4 4 Introduction With an internal perspective, some questions relate on how a java content repository should be implemented. Some companies are doing that over relational databases and some others are providing native implementations of the model. Should JCR be seen as a data model or as an abstraction layer over an existing data model. Answering this kind of question can have a strong effect on the future implementation of the products and also on the best way promote them. 1.4 How will it be compared? First of all, the chapter State of the art will try to give a snapshot of the main data models which have been described and used during the four last decades. This will be done with the purpose of identifying the main influences which have lead to the current market environment. The goal is also to understand why some data models have encountered success and why others have not. Then the comparison between the relational approach and the JCR approach will start. Because the two approaches show big differences on four different levels, these are the ones we will examine and compare, thus avoiding unnecessary discussion regarding incomparable aspects. The chapter data model comparison will be the first level of comparison. In this chapter, the two models will be formally defined, respectively; the relational model and the model used by JCR. This should help the reader to understand the theoretical concepts hidden by each model. The purpose of this chapter is also to show the impact of these theoretical aspects on real world problems and help people to understand more clearly why they should use one approach instead the other to solve their problem. The chapter specification comparison will be the second level of comparison. This chapter will leave the theoretical point of view for a more practical perspective. The SQL standard and the JCR specification will be compared more precisely in this chapter. This will allow us to show practically in which context the concepts described in the data model comparison make sense. Some differences which relate more to the specification definition will also be pointed out. The chapter Project process comparison will be the highest level undertaken in this report. On the basis of the previous chapters, a discussion will take place on different aspects and notable advantages which can significantly influence the development process will be looked into. This discussion will try to clarify parameters as the efficiency reached with each approach. The chapter product comparison will discuss the impact of data models on the products. The performance question constantly occurs at a product level. This chapter will try to address this question with a theoretical cost analysis and a practical benchmark. The Scenario analysis chapter can be seen as a synthesis of the main aspects pointed during the whole comparison process. Four test cases characterized by different features will be analyzed in regard of the significant aspects presented in this report. The purpose is to set the foundations of a framework which helps in choosing the best approach by doing quick requirement analysis. Appendices are also included in this document. They contain aspects which are not directly linked to the comparison but which are interesting for the person who would like to study the subject further.

5 2 State of the arts University of Lausanne & Day Software AG JCR or RDBMS 5 The necessity of splitting information from applications became clear in the 60 s when many applications had to access the same set of information. This segregation has given birth to new concepts and new roles which relate to the activity of managing information. This chapter will clarify the main roles and the main domains of responsibility linked to information management. Some of the main approaches which are used to handle information will also be presented. Basically, the idea is to build a common language for the following chapters. 2.1 Roles People are generally involved in information systems and data management. Three main roles can almost always be distinguished when data models and databases are mentioned: The database administrator (DBA) who maintains the database in a usable state. The application programmer who writes applications which may access databases. The user who uses applications to access, edit, and write data in the database. Each role generally relates to certain responsibilities. Several domains of responsibilities come from disciplines such as the design, the development or the security. Domain examples could be the structure, the integrity, the availability or the confidentiality of data. Choosing a data model impacts these different roles by attributing them more or less responsibility. 2.2 Domains of responsibility The Figure shows four main domains of responsibility which will be mentioned regularly in this report. This role/responsibility diagram tries to translate the classical repartition which is generally made when relational databases or similar approaches are used to manage data. Figure 2.2-1: Classical responsibility repartition The WordNet semantic lexicon gives the following definitions to the concepts identified as domain of responsibilities in Figure 2.2-1: Content: everything that is included in a collection and that is held or included in something Structure: the manner of construction of something and the arrangement of its parts Integrity: an undivided or unbroken completeness Coherence: logical and orderly and consistent relation of parts Content and structure are relatively clear concepts. However, in the context of this report, it makes sense to be precise as to which meaning is given to the integrity and to the coherence. Integrity here relates to the state of completeness of data which always has to be ensured in the database. This state is preserved with integrity rules at a database level. Coherence relates to the logical organization of data and quality thereof. Coherence can be ensured with

6 6 State of the arts constraints at a database level but also programmatically at an application level. For several reasons incoherence can be tolerated during a time in the database. This is not the case for integrity. Choosing a data model has an impact on the responsibility repartition in different ways. This report will try to detail this impact and show the consequences of these kinds of choices on the different roles. 2.3 Data Models A data model should be seen as a way to logically organize, link and access content. Since the 1960 s, some data models have appeared and disappeared for several reasons; this section will give us a brief overview of the history of the main data models. It will also give an overview of their respective reasons for success. Hierarchical Model In a hierarchical model (1), data is organized in tree structures. Each record has one and only one parent and can have zero or more children. A pure hierarchical model allows only this kind of attribute relationship. If an entry makes an appearance in several parts of the tree, this latter is simply replicated. A directed graph without cycles as depicted in Figure gives probably the best representation of how entries are organized in this model. zero of the hierarchy which has no parental relationship. The second type characterizes all the records which are located under the root record. They are dependants in the sense that their lifetime will never be longer than the lifetime of their parent. In the hierarchical model, each record can generally store an arbitrary number of fields which allow for storing data. While some real problems have a tree like structure, the assumption made that only this kind of attribute relationship governs the world is too strong. During its history, the hierarchical model has probably suffered from this. Some people have probably abandoned it for models which seem to fit better with the reality. The main implementation of the hierarchical model was in the 60 s by IBM. This database is called IMS which stands for Information Management System. Today, IMS is still used in the industry for very large scale applications. IBM sold it as a solution for critical online applications. In fact, IBM continues to invest in this product and to develop new releases. Most directory services are using concepts inherited from the hierarchical model. Moreover, reminiscences of the model are also visible on every system. Everybody use hierarchical concepts to organize files and folders. So every computer user is more or less familiar with the hierarchical organization of information. Furthermore, during the last decade, the hierarchical model has found a new popularity with the increasing use of micro format as XML or YAML. In a web browser, the Document Object Model (DOM) also uses a hierarchy of objects to organize the elements of a web page. Thus, this model is not in jeopardy of disappearing. It will probably continue to encounter further success in the future as well. Figure 2.3-1: Tree graph In general two types of entries are distinguished, the root record type and the dependent record type. The first type characterizes a record from the level Network Model Instead of limiting the organization of data around a tree structure, the network model allows to link entries between themselves in any direction. A directed graph, as shown in Figure 2.3-2, is probably the best representation which could be given to show how data is structured in a network model. The

7 University of Lausanne & Day Software AG JCR or RDBMS 7 other properties of this model are shared with the hierarchical model. Thus we can say that the hierarchical model is a subset of the network model. more often called tables, columns, rows and fields. Interestingly, today this model is so widely taught and used that the question of its pertinence to solve specific cases is rarely questioned. Figure 2.3-2: Network graph Initially developed during the 70 s to bypass the lack of flexibility of the hierarchical model, the network model has encountered a lot of success during this decade. This model has found hundreds applications in different fields of computer science such as the management of in memory objects or bioinformatics applications. However, it seems that actually not a lot of people are using it to organize their data. However it still has notoriety in embedded applications, whilst large scale applications built on it are slowly disappearing. Relational Model Before its definition by Codd during the 70 s, the relational model (2) had not encountered a lot of success. However, after this formal work based on the set theory and the first order logic, some companies chose to make implementations of this model. IBM was one of the first companies which took the lead in the market with the DB2 database. Oracle is now the uncontested leader with its implementations of the relational model. The relational model defines the concepts of relations, domain, tuples and attributes which are Figure 2.3-3: Relation, domain, tuple and attribute Some people link the success of the relational model to its mathematical foundation. However the implementations actually used are a far cry, from the beautiful concepts defined at the beginning. The main building blocks are now hidden by features which are provided to address practical requirements. Thus, the success of this data model should be linked to the practical answers which have been given to solve problems encountered in the business world during the 80 s and the 90 s (3). The normalization principle was used to earn storage capacity. Furthermore, during this stage, information systems had been widely used for automation and monitoring tasks. The relational model has offered a very good canvas to express and solve problems such as these.

8 8 Data model comparison 3 Data model comparison This chapter will define more clearly the JCR model and the relational model. Several aspects which relates to the model s foundations will be presented and compared. The main purpose of this section is to understand the philosophy or basis of each model. The Model definition section briefly presents the main ideas subtended by the models. The Structure and Integrity sections will mainly discuss the aspects which relates to the place respectively of the content, the structure and the semantic in both data models. The Operations and queries and Navigation sections will show different ways used to retrieve and edit content. Throughout the whole chapter, an important place will be given to the impacts of the choice made in terms of the data model and the reasons which should drive this choice. 3.1 Model Definitions Some works and references give definitions to the different data models actually used (4) (5). Some tools are also available to understand the main concepts of these models. The purpose of this section is not to enrich these definitions but they are included simply to draw attention to some theoretical aspects required in order to build a common language for the comparison. JCR Model To organize records, this model includes concepts inherited from the hierarchical and from the network model. Thus, as shown in the Figure 3.1-1, records stored with the JCR data model are primarily organized in a tree structure. However, the limitations of the hierarchical model are avoided by giving the ability to link each record horizontally. Attributes which point on other nodes can be stored at each level to create network relationships. This type of model permits the creation of a network in a sort of tree structure. Figure 3.1-1: JCR graph Currently, some explanation of the schema which relates to the data model definition can be founded in the specification (4) (5). The Figure 3.1-2, based on this information, attempts to express more formally the JCR data model. It s interesting to note that at this stage, no differentiation between the content and the structure can be made. In fact the structure appears with the instantiation of items. Figure 3.1-2: JCR class diagram

9 University of Lausanne & Day Software AG JCR or RDBMS 9 Relational Model The relational model which was quickly introduced in the state of the art chapter is based on the set theory. A relation as defined by Codd (2) made reference to the mathematical concept of relation. In his paper, he gives the following definition to a relation: R is a subset of the Catesian product S1 x S2 x x Sn Practically, because all these sets have to be distinguished from the others they are identified as domains. Thus, assuming the domains of first-names F, of last-names L and of ages A, a Person relation is a set of tuples (f, l, a) where f Є F, l Є L and a Є A. The Figure 3.1-3represents a table view of this relation. In this representation, each domain corresponds to a column and each tuple to a row. Figure 3.1-4: Relational class diagram 3.2 Structure A rich debate around the respective places of data and structure in data models has been ongoing for several years both on the web (6) and in academic fields (3). This debate could be summarized as following: Should data be driven by the structure or should the structure be driven by data? Figure 3.1-3: Relation, domain, tuple and attribute This basic definition does not mention the ability to create associations between relations. In fact there is no link between the name of the model and associations. The ability to express associations comes later with the joint operations defined by relational algebra. These operations will be introduced later in the next sections. The Figure show a class diagram which could be used to express relations. While the pertinence of this kind of diagram can be discussed the purpose is to give a simple and visual base of a relation. Furthermore, parts derived from this diagram will be reused later to express the intersections between the relational model and the JCR model. These discussions come from the fact that some concepts do not really fit into a predefined canvas. A predefined canvas can covers a lot of advantages and facilities. For example, it s easier to express integrity constraints on a well known structure. Equally indexation or query optimization (7) can also benefit from the assumption that a clear structure can always be found to a problem. However, in real life situations, there is always an exception which does not conform to the canvas. The following sections will situate two models which apply to this context. Both approaches will be presented with the data and the structure shown respectively in each case. Clarification of when each strategy could be considered logical or illogical will also be identified. JCR model In Figure 3.1-2, a class diagram shows the main aspects of the JCR data model. In this figure, the instantiations of nodes, properties and values leads to the creation of content. If we try to identify the

10 10 Data model comparison structure s place in this diagram, it appears that no real differentiation is made between the content itself and its structure. Thus, the model proposed by JCR does not require the definition of a structure to instantiate content. Instances of nodes, properties and values can be created before defining any kind of structure. In fact, the structure appears with the content. A parallelism can be made between this approach and the semi structured approach described during the end of the 90 s (8). No separation was made between data and their structure. This provides two possible advantages, firstly a dynamic schema, to store data which does not fit into a predefined canvas or secondly to be able to browse the content without knowing its structure. Some modern programming languages such as Ruby or Python also give the ability to extend objects on the fly with properties and functions (reflection). While a part of the structure appears at runtime, it is possible to define a semantic which identifies the main concepts. In JCR this is done with node-types. Basically, defining a semantic does not limit the capacity of a node to store an infinite combination of sub-nodes and properties. To proceed in this manner allows for the creation or evolution of records when and as required. For example if we want to define a semantic item for media, there is no real need to take into account all the possible properties which could appears during the application life cycle under this node. Each special case of media items, such as images, videos, etc. can have specific attributes which are not impacting the whole set of media instances and which do not necessarily have to be specified at conception. Relational model Figure represents a basic class diagram describing succinctly the main ideas proposed by the relational model. We see in this diagram that the concept of record which is represented by the Element class is separated from the structure. Remark that the paradigm is completely different in the relational model to the one proposed by JCR. A structure made of relations and domains has to be instantiated. Then, tuples which fit into this structure can be created. While the DBA can choose the level of flexibility in the initial structure, it appears that this kind of model differentiates between the data and its schema. Differentiating the structure from the data can reap some benefits. For example this would be appropriate for a problem solving approach rather than a data storage approach. This is evident as many developers will create an entity relationship model during the early phases of defining data requirement. However in real life situations the assumption that content and structure can be completely separated is not always valid. For example to handle expansion in the relational database some artificial artifacts or miscellaneous fields are often created to allow for this expansion in the relational structure. These can take the form of fields added to create hierarchies or fields added to define customized orders in a set of tuples. These conceptual entities can become difficult to describe within the confines of the structure. As the application evolves and new requirements are added the management of the additions can become difficult and dangerous. A change could even imply a rethink of the whole structure of the implementation. Content, structure and responsibility As shown in the state of the art chapter, in classical situations, the DBA is generally responsible for the data structure. The application programmer can influence decisions made in this area but he does not have the final responsibility for the structure. Finally, the user has clearly nothing to say, his scope is limited by the functionalities developed by the application programmer to create, remove and update data. As shown in Figure 3.2-1, choosing a content driven approach instead a structure driven approach significantly impacts the respective roles of the DBA, the application programmer and the user. In fact the DBA loses his responsibility of main structure owner. If the structure is driven by data, this ownership is shared with the application programmer and the user.

11 University of Lausanne & Day Software AG JCR or RDBMS 11 complex problems driven by data instead by structure? Not necessarily. In the example of the house and of the city the problem could be seen as following. For houses, because budgets and resources available are generally known in advance, the most effective way to proceed is to define a structure before the construction. For cities, because resources and budgets available are generally not known in advance and are evolving, the most effective way to proceed is to let their structure emerge. If necessary, guidelines can be defined to control their growth. Figure 3.2-1: Responsibility repartition revisited It is true that a clear separation between the content and the structure makes some aspects of data management easier. Splitting clearly the structure and the content makes it easier to define roles and to separate the duties. The DBA has the ownership of the database and of all the structures which allow to instantiate records. In this context, the application programmer becomes a kind of super user with extended rights but the user may only access what is available in the application. This kind of scenario does give a lot of responsibility to the DBA and places him at the centre of database evolution. Unfortunately he is not necessarily tuned in to the real needs of the users. It would therefore be advisable that the DBA be responsible more for aspects of data integrity, the availability or the recoverability of data and not for the structure or the content. In general these should be left under joint definition to the application programmer and the user. Choosing the right approach In a real working environment, some problems benefits from being driven by a structure whereas others clearly do not fit into any predefined structures. A simple analogy may help to explain this complicated situation. For example houses are rarely built from scratch without blueprints. However, if we take the scope of cities, there are generally no blueprints which plan their final states. So which lessons can we learn from this simple example? Are Since information system problems involve a wide and growing community of stakeholders and providers cannot know what will be done with their applications, these kind of questions should be debated at the onset of the design: Are the users known or not? Is the behavior of the users known or not? Is the final usage of the application known or not? Are entities fitting in a canvas or not? The response to these questions is probably one of the best indicators when deciding upon one of the two approaches. The JCR model advocates clearly for a structure driven by data. By creating content, items, nodes and properties, users are building the structure. Database administrators and application programmers are just guiding this structure by defining rules and constraints. In model implementations made with a relational approach, a structure is first defined by the database administrator and the application programmer. Then the users can register content items which fit to this structure. Depending on the case in use each data model could be useful. It rests basically through which perspective we wish to view the data a fixed structure or a more flexible data driven model. The choice of model will be based on the certitude or incertitude of the responses to the few decisive questions as stipulated.

12 12 Data model comparison 3.3 Integrity A strong association between structure and data integrity is often made. Thus some people are afraid of letting their users taking part in the definition of the structure. However, it s more correct to say that data integrity belongs to semantic. Generally, integrity definitions do not make any mention of the structure. A structure made of relations and domains is evidently an elegant way to express a semantic. It s also a good basis in which to declare integrity constraints. Nonetheless integrity constraints can be defined at a lower level, directly over a semantic. Advantages could be for example that all the structures which respect the semantic constraints can be instantiated in the database and not only the records which fit into the structure. Furthermore, as mentioned in the state of the art chapter, integrity definitions generally do not make mention of coherency. In the database environment, an amalgam is often made between these two concepts. While data coherency can be preserved by integrity constraints, the integrity of a dataset is not necessary lost if incoherent records are present in the database. be treated programmatically at an application level in a way which alleviates the work load of the system. JCR Model An analogy can be made between the JCR model and a black list. The most generic node sustains any kind of children, any kind of properties and any kind of values. A mechanism is provided through the concept of node-type to let the DBA defining integrity constraints. In the JCR model, node-types are used to express a semantic. Declaring constraints on this semantic allows the declaration of restrictions on the nodes and on their content. Each node has a primary nodetype and can have several mixin node-types which extend the primary node-type. Node-types allow for specifying constraints on the children of a node, on the properties of a node and on the values of the properties stored by a node. Unquestionably data integrity means that no accidental or intentional destruction, alteration, or loss of data should ever occur. While data integrity should be ensured at all times during a database s lifecycle the assumption that data coherency should have the same property is probably too strong. Some people have the habit of treating directly in the database both aspects, everything which relates to data coherency along with integrity constraints. This ensures that the data coherency is preserved in all the cases. However, this also has a cost in term of performances and checks which have to be performed each time a write access is made on the database. Therefore a tradeoff has to be made between data integrity and data coherency. A balanced approach which can result in a better user experience consists in identifying, sometimes arbitrarily, what relates to integrity and what relates to coherency. Data Integrity will be treated with constraints at a database level. Data coherency will Figure 3.3-1: JCR model and integrity Using several node-types permits the possibility of ensuring the integrity of transitive relations in a hierarchy. For example, it is possible to define a node-type which support only children with a specific type. The later could also have node-types which declare constraints for their children. Proceeding in this fashion would narrow down the usage within a node, that the children of the children of a specific node should have a certain type. When integrity is mentioned, we often speak about entity integrity, referential integrity and domain integrity. These concepts relate closely to the

13 University of Lausanne & Day Software AG JCR or RDBMS 13 relational model but as shown in Figure we can find similar ways to express constraints in the JCR model. Entity integrity is ensured by the fact that basically each node is unique and identified by its location in the data model or by its UUID. Paths cannot really be considered as unique identifiers because same paths sibling are allowed for XML compatibility. Referential integrity is ensured by the fact that all the references properties of a node have to point on a referenceable node. Furthermore, a referenceable node cannot be deleted while it is referenced. Domain integrity can be ensured by forcing nodes to have specific properties which contain values in predefined ranges. Data coherence can be checked with integrity constraints but the model does not provide all the tools to do a complete coherency check. This proves that making a separation between the two areas is beneficial. Integrity should be ensured at the data model level and data coherency at the application level. Relational Model An analogy between the Relational model and a white list is appropriate. As explained in the last section, the relational approach made the assumption that structure and content have to be separated. Thus saving content is allowed only if a structure has been defined. Some integrity constraints are implicit to the relational structure. The domain constraints ensure, for example, that all the values stored in a same domain have the same type. The entity integrity constraints give the guaranty that, due to the primary key, all records in a table are unique. Furthermore, the structure is generally taken as a base on which to declare other integrity constraints. The referential integrity ensures that a foreign key domain is a subset of the pointed domain. In the same way some other integrity constraints which make use of the operations proposed by the model can be described. Figure 3.3-2: Relational model and integrity A structure known in advance and from which the evolution is controlled is an elegant base to ensure integrity. The syntaxes which permit the expression of integrity constraints are generally derived from first order logic. The fact that the main building blocks of the relational model are based on well known mathematical disciplines, respectively the set theory and first order logic, permits the expression of implementation models which share these mathematical properties. In term of data integrity, this provides advantages because the solidity of the implementation model can be mathematically proven. In its simplicity, this way of proceeding also allows the opportunity with short statements to declare rules and constraints for nearly everything. As a result, solid implementation models can be quickly declared with a high level of accuracy and a minimum level of programming effort. However, as mentioned before, the assumption that each problem can fit in predefined structure is often too strong. Furthermore, while the relational model has the ability to express hierarchies and network

14 14 Data model comparison structures, the first order logic is limited when having to declare them with constraints. In conclusion, it s often difficult to know what should be managed at a model level or at an application level. Integrity, coherency and responsibility In general, DBAs have the custom of declaring very strong structures. Their implementation models are thought of as white lists which preserve data integrity and data coherence. However, to build generalized and flexible implementation models it is really only the data integrity level which should be constrained at model level. Furthermore the argument that data integrity and data coherency should be the responsibility of the DBA does not really reflect the reality or the ideal, as all of the tests made at an application level to ensure that users do not inject into the data, testify to the veracity of this fact. Choosing the right approach The argument that the relational model has mathematical properties (2) which will ensure rock solid data integrity is often selected for the wrong reasons. In fact these properties are only used for very specific applications and the integrity of an implementation model as understood here is rarely proven mathematically because it is not a requirement. The choice of the best approach should be made with regard to the responsibility given to the DBA and to the application programmer. The following two examples can illustrate this idea. On one hand, a prison guardian must control all the movements of the people in the prison during the day. In this case, a rock solid program conceived as a white list is ideal. The people may only do the things that they are allowed to do. On the other hand, a tourist guide has to ensure that the travelers have a good trip by directing them and giving them the right information. In this case, a program conceived as a black list will probably give more satisfaction to the user. Some functional cases do not benefit from being governed by a lot of constraints. Unfortunately, the relational model often leads DBAs and application programmers to design restricting implementation models. This gives them the feeling that their applications is well thought out but often it only frustrates the users. The following questions should be honestly asked: Figure 3.3-3: Responsibility repartition revisited Therefore the clarification of the repartitions of responsibility of such checks would be of an enormous benefit to the overall functionality. This would help in defining reasons in choosing any given model. Equally it identifies any shortcuts on aspects of data integrity and helps to avoid these sort of pitfalls. Furthermore, dividing clearly the responsibility of the integrity and of the coherence could enhance the ability to design more intuitively applications which take into account the cost of the checks made at a data model level. Do users have to be guarded or guided? Does data coherency have to be preserved at a database level or at an application level? Therefore choosing the good data model is not only a question of preferences but it should be based on a choice which is always related to the analysis of the case in use. 3.4 Operations and queries Query languages are close to fields as relational algebra, first order logic or simply mathematics. Depending on the cases, queries can be expressed with declaratives calls or with procedural languages. In general, queries are composed of several

15 University of Lausanne & Day Software AG JCR or RDBMS 15 operations which make use of the structure or of the data semantic. Some operations can be used in queries. These operations such as the selection, the projection, the rename or others set operations are inherited from the disciplines mentioned at the beginning of the section. In addition to these operations, some query languages provide statements which allow creating, modifying or deleting of data. This section shall clarify the bounds of each model in term of queries and operations. JCR Model An abstract query model is used as a basis to retrieve data in the JCR Model (4) (5). This query model makes a kind of mapping between the JCR model and the notions of relations, domains, tuples and attribute present in the relational model. The Figure is a modified version of the Figure which visually shows this mapping. Figure 3.4-1: JCR model, operations and queries It seems that, in the actual state, node-tuples are seen as relation, property as domain, nodes as tuple and values as attributes. Basically node-tuples are arbitrary sets of nodes. However, node-types are used as the main source of node-tuples in queries. While this kind of mapping could not be considered as an application of the principles of the set theory, it allows the running of some interesting queries which can satisfy nearly all requirements. The operations provided by this query model are the selection and the ensemble of set operations which permit the performing of joins between node-tuples sets. The result of a query is composed of all the nodes which satisfy the selection condition and the join condition. Basically, in the JCR model, queries are seen as a way to perform search requests. This provides a way of retrieving records but this selection criterion does not however allow them to be sequentially deleted or updated. This functionality is not dictated by conceptual barriers, it could be modified as required. As mentioned before, the structure and the schema are not separated in this model. Thus, some attributes of the records at their depth level or their hierarchical path can be viewed as properties. This opens up the ability to easily perform queries on things which are generally not taken into account in other models as transitive relationships in hierarchies. Relational Model The relational algebra defines the primitive operations available in the relational model (9). These operations are mainly the selection, the projection, the rename, the Cartesian product, the union and the difference. The power of this query model states in fact that the input and the output of these operations are always relations. Thus, it s possible to express complex statements and imbrications. In addition to these operations, some mathematical operators can be used. It s also possible to specify additional domains for the output relation. Some domain operations are also provided to retrieve information for example the number of attributes stored in a domain or the domain s maximal value. The query languages which are provided by relational database implementations generally propose statements which allow modifying, creating or deleting data (10). Used in conjunction with the previously presented operations, these statements become very useful. They provide a means of performing sequential changes on data sets which reply to precise conditions. The possibilities given by the usage of these operations are huge. However limitations are encountered when transitive relationships appear (11). This sort of query cannot be expressed with first order logic statements. For example, if it is not possible to define a query which retrieves all of the

16 16 Data model comparison descendants of an element some other solutions are available (12). They do however often add complexity to the implementation models. Choosing the right approach While JCR provide a means of carrying out some operations and queries, the relational model is clearly more complete in this area. In some situations, this completeness can become a decision criterion if the case in use implies that complex join operation may be required. The features proposed by most of the relational databases which allow the use of operations in conjunction with update and delete statements is also a significant advantage proposed by this relational model. For the use case which involves a lot of write access, this possibility allows for quick creation, update and deletion of content. However, caution should be taken with this type of usage when complex hierarchies are present. 5. He can start from the owner of a set and sequentially access all the member records. (This is equivalent to converting a primary data key into a secondary data key.) 6. He can start with any member record of a set and access either the next or prior member of that set. 7. He can start from any member of a set and access the owner of the set, thus converting a secondary data key into a primary data key. These rules give the programmer the ability to cross datasets by following the references which are structuring the records. The interesting point on this approach is that the programmer can adopt access strategies without knowing the whole structure of the database. As a navigator, he explores the database. 3.5 Navigation During the 70 s, Charles W. Bachman described different ways of accessing records in databases (13). By focusing on the programmer s role, he describes his opportunities to access data as the following: 1. He can start at the beginning of the database, or at any known record, and sequentially access the "next" record in the database until he reaches a record of interest or reaches the end. 2. He can enter the database with a database key that provides direct access to the physical location of a record. (A database key is the permanent virtual memory address assigned to a record at the time that it was created.) 3. He can enter the database in accordance with the value of a primary data key. (Either the indexed sequential or randomized access techniques will yield the same result.) 4. He can enter the database with a secondary data key value and sequentially access all records having that particular data value for the field. Figure 3.5-1: Navigation path Rules, as defined by Charles W. Bachman, can be implemented as procedural calls made over an API or as declarative statements. The main difference between the queries mentioned in the previous section and the navigation principles defined here are the following. Queries are built over the semantic or over the structure of the data model. Navigation is independent of the semantic or of the structure and directly uses the content. Thus, in our context, XQUERY and XPATH should be considered as navigational languages because they use the content to navigate in XML files. JCR Model In the JCR Model, each record stores properties which relates to the localization of the item in the database. The level, the path and, under certain conditions, the unique identifier are good examples of these specific properties. The rules mentioned before are nearly all included in the model and allows

17 University of Lausanne & Day Software AG JCR or RDBMS 17 for the navigation through the database with different types of strategies. The root node can be seen as the beginning of the database. As mentioned in the first rule, it gives the ability to sequentially access all the sub-nodes. The path and the unique identifier properties allows navigating in a way which respects the second, the third, and the fourth rules by giving specific entry points for specific situations. The node types and the parent nodes can be seen as set owners and thus allows for the navigation of the database in ways which respect the fifth, sixth and seventh rules. These possibilities offered by the JCR Model (4) (5) give the programmer a lot of flexibility. He is really able to navigate through the data and adopt strategies which will allow him to find data in structures that are unfamiliar. Relational Model In the relational model (2), records are seen as basic tuples of values. Basically, these data structures do not know their localization in the database and are not ordered in relations. To enter the database, a programmer must have a good knowledge of the schema and of the data organization. In one sense, we could say that the fifth rule previously defined is fulfilled. However, because the records are not ordered, it is not really the case. Thus, the relational model does not take into account these rules at all. The relational model only defines a way to organize data and shifts the navigation problem to a higher level. 3.6 Synthesis The two data models show fundamental differences. The approach s choice highly relates to the degree of flexibility which has to be given to the user. This choice also relate to the nature of the requirements which involve clear or abstract entities. The choice of the data model should always be made by doing a good analysis of the use case. The selection of an approach also affects the main roles and responsibilities which relate to data management. A requirement would be that all of the people using a database should be informed clearly of their roles accompanied with guidelines of usage. Paying particular attention to certain previous data usage habits as they would have to be changed or their usage need to evolve if a new data model is chosen. Some users could voice reticence concerning these factors as conservative behavior is an obstacle when deep changes arise. The data model s choice should not be affected by this type of reasoning. The advantages engendered through good and coherent choices are enormous and can have a significantly impact on the application and the development process. Choosing the right approach In term of navigation, both models are not comparable. The signification given to the units of content are really different. Thus choosing the right approach depending on the use case is not really hard. If the use case involves traversal access, exploration or navigation in data, a model which includes these concepts is always superior.

18 18 Specification comparison 4 Specification comparison Specifications describe the features that databases should support. The main specification for relational database is without doubt SQL which has been released several times (SQL92, SQL98, SQL**) since its first edition and which is more or less implemented by each relational database provider. The JCR Specification was released in 2005 (JSR 180) and a second version of the specification is in incubation (JSR 283). Some companies as Day, Alfresco or Oracle provide implementations of this specification with different levels of compliance. He also wants to provide a book preview for the authenticated customers and partners and let the partners show the whole digital copy of the books. In addition to the ability to navigate through collections, partners and customers should be able to search products ISBN number, with full text criterions, or by asking for the most successful items. We could discuss the many aspects of each specification which would take a long time but the principal objective in this document is to highlight the philosophy behind the specifications which provide practical answers which solve common problems. It is for this reason that, the examples shown in the following sections are essentially based on the SQL92 specification and on the version 1.0 of JCR. The first section of this chapter presents a use case which demonstrates how each specification can give practical answers to running problems. Being well balanced it shows the possibilities and limits of each model. The four following sections will essentially show how the concepts presented in the Data model comparison chapter actually take form in the specifications. Finally, the last section will point to practicalities by presenting features which respond to the more common differences in requirements. 4.1 Use Case Definition Consider an editor who sells books and wants to create a system to manage his book collection and his orders. A book collection is composed of books and sub collections. A book can be tagged with keywords. Through a website, the editor wants to let anonymous visitors navigating through the whole catalogue by collection. Figure 4.1-1: Editor use case diagram The Figure 4.1-1is a draft of the use case diagram of this application which summarizes the main actors and the main features which have been identified during the conception process. In the next sections, this use case will be used to point to some key aspects which differentiate the relational databases from the java content repositories. 4.2 Structure In term of structure, both approaches are radically different. However, it makes sense to understand how each specification makes use of the basic concepts presented in the Data Models chapter.

19 University of Lausanne & Day Software AG JCR or RDBMS 19 This can assist people developing implementation models and in solving practical problems. JCR Specification As other unstructured and semi-structured models, the JCR Model does not make a separation between data and their structure. Thus, there are no specific needs to identify entities and attributes as required by relational databases. It is also important and useful to identify the semantic beforehand or in other words, identify the concepts represented by nodes in the content repository. This can be done by defining a node-type or by specifying an attribute which declares the type of the node. The schema depicted in Figure does not represent the structure of the repository. It simply shows how the main concepts which can be found in the structure should be organized. <editor = ' [editor:person] > nt:unstructured [editor:order] > nt:unstructured [editor:orderline] > nt:unstructured [editor:collection] > nt:unstructured [editor:book] > nt:unstructured [editor:tag] > nt:unstructured Table 4.2-1: Node-types The most intuitive way to design this structure or organization is to think in term of its composition. Simply the manner in which, one concept will always be a component of another concept. If UML class diagrams are used during the design phase, it consists only of translating the composition relationships into hierarchies. The various other associations will be stored as references or paths as properties. More tips on how to design JCR applications are available in the Appendix JCR and design appendix. In considering the environment as structured we are often unable to translate clearly this structure. Consequently, keeping the schema as weak as possible, allows easily to take into account new requirements at runtime by simply recording new data. If node-types are used as markers, it make sense to simply let them extend the nt:unstructured node-type without adding more constraints. Figure 4.2-1: Semantic diagram The root can be seen as the editor system which is dealing with persons, orders, order lines, collections, books and tags. This diagram does not take into account the additional artifacts which could be added in the content repository to organize data. Thus, at design time there is no real need to fix all the attributes and all the entities. In this example, some decisions can be taken later by the application programmer. The general idea is simply to leave open the place for new requirements. SQL Specification As explained in the previous chapter, the relational model implies that data and their schema are separate. In practice this means that all the tables and their respective columns have to be identified at the time of design. During the development process the entity relationship notations are often used for this purpose.

20 20 Specification comparison Figure 4.2-2: Entity relationship diagram For the editor s use case, means that some decisions need to be made which will strongly impact the future evolution of the application. Data security and save routines must make use of the predefined columns. Everything has to have been describe clearly previously. For example the identification of what an order, what a book is and what a customer is imperative. Hence the final application must and will reflect all these decisions which are often arbitrary. Figure 4.2-2: Entity relationship diagram shows a database schema which reflects the decisions which have been taken during the design phase. In this use case, it is relatively easy to find relations and domains for the main entities as person, order, order line and tag. At design time, their attributes can clearly be identified and it is quite easy to conceive a relational schema for them. However, the book entity is difficult to fit into a table. For example, this schema only stores the title and the description of the book. However as a requirement there is a need to also store a digital copy and a preview of the book. The content of the book could be part of the database or it could be stored somewhere else in the file system. This kind of decision is completely arbitrary and has an enormous impact on the application s life cycle. 4.3 Integrity As mentioned, integrity can have different meanings. In the database vocabulary, integrity generally relates to the fact that accidental or intentional destruction, alteration, or loss of data should not happen. It also relate to the state of completeness of data which have to be preserved in all cases in the database. This section will make a quick roundup of the possibilities proposed by JCR and SQL to deal with integrity. JCR Specification Data integrity can be ensured in JCR with nodetypes. Some predefined node types are specified by the JCR specification. These represent different concepts which are often encountered in repositories such as folders, files, links, unstructured nodes, etc. These node-types can be extended and rules which force the nodes to respect certain rules can be defined. In our use case, the state of completeness of data which always has to be preserved in the database

21 University of Lausanne & Day Software AG JCR or RDBMS 21 does not require a lot of constraints. In a real-time situation, it could happen that a person places an order and comes to take direct delivery of the product or a special edition of a book could have no ISBN. We often say that this kind of decision has to be taken into consideration. However they should not be taken at a level which is detrimental for future requirements. The only integrity constraints we might choose to define concern the orders and the order lines. For law compliance, it would be necessary that an order stores a date and that an order line stores a property with a unit price and a quantity. This is shown in Table <editor = ' [editor:order] > nt:unstructured - 'created' (Date) mandatory [editor:orderline] > nt:unstructured - 'quantity' (double) mandatory - 'unitprice' (double) Mandatory Table 4.3-1: Node-type and integrity constraints The fact that an order line can only be found under orders node cannot be expressed at a repository level. However, this constraint can be taken into account at an application level. We might also need to define a referential integrity constraint between the ordered product and the order line. The code shown in Table demonstrates how this can be done. [editor:orderline] > nt:unstructured - 'product' (reference) Mandatory Table 4.3-2: Node-type and referential integrity The meaning for this kind of attribution could be discussed at length but keeping a strong reference between the product and the order line which implicates referential integrity does not really make sense. A product can evolve and this sort of association would lose its signification. Furthermore the editor may want to sell in the future a service instead of a book. Therefore imposing referential integrity is probably extreme and we can consequently more realistically accept broken references between order line and product. The same comment can be made for the tags which are made with an association of a similar nature. SQL Specification The fact that, in the relational model, the structure is separated from the content and that it has to be described leads to creating data models which are a representation of what will be the final usage of the application. Furthermore because some integrity rules are implicit to the model, DBAs generally do not hesitate in defining all of the integrity rules which will enclose the preservation of the entire data coherence at design time. In practice for the editor s use case, this means that some application logic can be translated into integrity constraints. With check constraints, we could ensure that the quantity attribute of an order line is always positive. With referential integrity, we can ensure that when a tag is deleted that, all the links which concern this tag are also deleted. The statements in Table and Table show how this can be achieved. CREATE TABLE IF NOT EXISTS `mydb`.òrderline` ( Òrder_idOrder` NOT NULL, `Book_isbn` VARCHAR(45) NOT NULL, ùnitprice` DECIMAL(11) NULL CHECK (unitprice > 0), `quantity` INT NULL CHECK (quantity > 0), PRIMARY KEY (Òrder_idOrder`, `Book_isbn`)) Table 4.3-3: Table and integrity constraints CREATE TABLE IF NOT EXISTS `mydb`.`tag_has_book` ( `Tag_idTag` INT NOT NULL, `Book_idBook` NOT NULL, PRIMARY KEY (`Tag_idTag`, `Book_idBook`), CONSTRAINT `fk_tag_has_book_tag` FOREIGN KEY (`Tag_idTag` ) REFERENCES `mydb`.`tag` (ìdtag` ) ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT `fk_tag_has_book_book` FOREIGN KEY (`Book_idBook` ) REFERENCES `mydb`.`book` (ìsbn` ) ON DELETE CASCADE ON UPDATE CASCADE) Table 4.3-4: Table and referential integrity The advantage of referential integrity constraints is not negligible. They minimize the efforts made at application level to ensure the coherence of the data stored in the database. However in the case of the tag, if the tag is attributed a thousand times, deleting one tag will imply a thousand and one write accesses. If tags are changing a lot, the system will probably not sustain these integrity checks. A better policy could be to allow incoherent tag attributions to

22 22 Specification comparison survive in the database and to delete them if they are incoherent during the next read access. Specifying all the integrity constraints at a model level can lead to performance and scalability problems but it also restricts potential utilizations which have not been identified at design time. Implementing a new requirement would impose a new development cycle which starts from the implementation model definition and finishes with the implementation of the user interface. 4.4 Operations and queries In term of operations and queries, we could consider the four following requirements. The editor wants to identify the top 10 best sellers. He also wants to change the status of all of the orders which respect some specific conditions. He wants to be able to retrieve all the books which are under a specific collection and finally, he wants to perform full text search on all items stored in the system. JCR Specification The abstract query model of JCR is implemented in several ways for different utilizations. The version 1.0 of JCR uses a common subset of XPATH and SQL which opens up the opportunity for some interesting requests. The draft of the version 2.0 declares XPATH as deprecated and replaces it by a query language which uses java objects. The first requirement which is aimed at identifying the best sellers cannot be easily expressed with JCR in one request. The reason being is that domains operations as Max and Min are not included in the specification, joins only allow the retrieval of books which have been ordered at least once (Table 4.4-1). SELECT * FROM editor:book, editor:orderline WHERE editor:book.jcr:path = editor:orderline.product Table 4.4-1: simple JCR query As shown in Table 4.4-2, the top 10 can be realized by doing a query for each book which returns its number of related orders. Then, the sum of the results can be used to create the top 10. This is good for simple queries but if connections which include domains operations are needed, the complexity of the code is extensive. The second requirement which is aimed at changing the status of some orders cannot be expressed with a single query. However, the results can be accessed and modified through the navigation API. If the selection criteria involves domain conditions or many connections this kind of query becomes very complicated. SELECT * FROM editor:order WHERE date < ' T00:00:00:000TZD' ( ) NodeIterator ni = queryresult.getnodes(); while (ni.hasnext()) { Node n = ni.nextnode(); n.setproperty("status", "closed"); } Table 4.4-2: JCR query and iteration on the result Retrieving all the books which are stored under a collection is very easy to implement (Table 4.4-3). Some properties which relate to the record (path, uuid, etc.) are accessible through XPATH and SQL. The strengths of JCR and its features are very evident in this type of situation. SELECT * FROM editor:book WHERE jcr:path LIKE '/collections/science/%' Table 4.4-3: JCR query and hierarchy JCR offers domain independent functions which allow the execution of queries on all the properties stored in nodes. As mentioned, the JCR model is unstructured, and the nodes do not have to reflect the same properties. Therefore this is a very powerful functionality for all the use cases which require full text searchs. As illustrated in Table retrieving the set of nodes which contain a specific sequence of characters is very simple. SELECT * FROM nt:base WHERE CONTAINS(*, '*computer*') Table 4.4-4: JCR query and full-text search In conclusion, the use cases which are presently characterized by a lot of join and domain operations will not really benefits from the features proposed by JCR. On the other hand, in term of operations and queries if the use cases characteristically require hierarchical queries, full text search queries and search queries in binary content, a java content repository would be advisable.

23 University of Lausanne & Day Software AG JCR or RDBMS 23 SQL Specification As explained in the last chapter, the relational model shows all of it power when the requirements need connecting operations and domain operations. Furthermore, if the requirements need to perform a high volume of sequential changes to large volumes of records the possibilities offered by this model do not respond favorably to these needs. The first requirement, retrieving a top 10 of the most sold books can easily be expressed with SQL. The Table shows how this can be done with a simple join and a group clause. SELECT b.isbn, b.title, sum(o.quantity) FROM editor.book b JOIN editor.orderline o ON o.bookisbn=b.isbn GROUP BY b.isbn ORDER BY sum(o.quantity) DESC LIMIT 10; Table 4.4-5: SQL query and simple join operation Updating the status of the orders is also quite easy to implement with one query (Table 4.4-6). This kind of statements is very useful when sequential modifications which answer to complex conditions have to be performed on the dataset. UPDATE editor.`order` o SET o.`satus` = ('closed') WHERE o.`date` <= curdate() - INTERVAL 1 YEAR; Table 4.4-6: SQL update query The third requirement is more complicated to realize. In this case, the depth of the hierarchy of collections is not known in advance and it is not possible to define an SQL query which takes into account this unknown parameter. Another possible way to proceed is to recursively retrieve the collections with a statement similar to code found in Table 4.4-7, followed by running a query on all the books stored under these retrieved collections. SELECT c1.id FROM collection AS c1 JOIN collection c2 ON c1.parentid = c2.id WHERE c2.id = $categoryid; SELECT * FROM book as b WHERE b.collectionid = $categoryid[0]; OR b.collectionid = $categoryid[1]; OR b.collectionid = $categoryid[n]; Table 4.4-7: SQL query and recursion limitation Nested sets can be used to avoid recursive calls. However the performance costs needed to update the hierarchy are randomized. Nested intervals (12) solve partially this problem but, as nested sets, they incur some maintenance complexity. While relational databases permit the management of hierarchies, they do not exactly provide the right or effective tools for this maintenance. Applications programmers tend to use frameworks to manage these requirements in a more elegantly manner. Performing full text search queries on a relational database require a good knowledge of the structures. In fact, only the columns specified in the statement will be considered in the result. For complex models, alternative solutions with external indexes are often used to perform this kind of request. SELECT * FROM book as b WHERE b.title LIKE '%computer%' OR b.description LIKE '%computer%'; SELECT * FROM collection as c WHERE c.title LIKE '%computer%' OR c.description LIKE '%computer%'; SELECT * FROM tag as t WHERE t.title LIKE '%computer%' OR t.description LIKE '%computer%'; Table 4.4-8: SQL query and full-text search limitation The Table present the non standardized syntax proposed by MySQL for full text search. Unfortunately, a problem linked to the structure is not really solved and this solution does not support full text search for multiple tables. SELECT * FROM book as b WHERE MATCH ( b.title, b.description, b.isbn, ) AGAINST ('word'); Table 4.4-9: MySQL and full-text search The first requests in this section show the power that can be reached by combining different operators in declarative statements. For complex models which imply sequential data modification in conjunction with domain operations, relational databases make more sense. However, the force engendered by a structure disappears when the case in use involves features linked to hierarchies, networks and search on semi structured data. Therefore a good knowledge of the whole use case is required before being able to make a choice between the two options.

24 24 Specification comparison Figure 4.4-1: Unstructured entity 4.5 Navigation In our use case, the entity book has not been clearly defined. This type of entity is difficult to concretize. Some other unknown entities are identifying it as a title, paragraphs, images, pages or covers. Furthermore these entities can vary from one book to the other. For the editor s use case, we could consider the two following types of books saved in the system. Firstly one could be considered a roman, essentially composed of ordered chapters, titles, and paragraphs. Secondly another one as a comic composed of ordered cartoon boards or planks. JCR Specification Without a doubt, navigation constitutes the main feature proposed by the JCR specification. Creating and exploring a tree or a network structures is not always easy. Navigation simplifies this. The API proposed by the JCR specification allows navigation in and through records with direct access or traversal access. A session is the main entry point of the repository and provides a traversal access to the root node and a direct access to each node by using their uuid or path. Each item of the repository also provides navigational functionalities which make use of direct access through relative path or traversal access through children, properties, references or parents. This API also provides write features to the repository. Thus, the Table show how new nodes, properties and values can easily be created and saved. session.getrootnode(); session.getnodebyuuid("uuid"); session.getitem("path"); Node.getNode( name ); Node.getNodes(); Node.getProperty( name ); Node.getProperties(); Table 4.5-1: JCR navigation API As mentioned, in our use case the entity book cannot be completely defined at design time. That is why the application programmer should give the user the ability to decide what a book is at the entry point. At the moment of creation the application programmer will not be occupied with what types of entities are present in a book. He will let the user define them at a later stage. The book can be identified by displayed the configuration of its components. public void displaybook(node book) throws RepositoryException { this.traverse(book); } public void traverse(node node) throws RepositoryException { NodeIterator nodeiterator = node.getnodes(); displaynode(node); while(nodeiterator.hasnext()) { traverse(nodeiterator.nextnode()); } } public void displaynode(node node) { // display logic... } Table 4.5-2: JCR traversal access The methods shown in Table try to schematize the advantages that can be reached by using navigation. There are a few possibilities now accessible to the application programmer. He could provide tools to let the user store the display logic

25 University of Lausanne & Day Software AG JCR or RDBMS 25 directly in the nodes, giving the maximum flexibility. This kind of strategy can be adopted through features proposed by the JCR specification. A framework as Sling can facilitate this task. SQL Specification As mentioned, the relational model does not take navigation into consideration and forces the responsibility on the programmer to implement these features. Furthermore, all the entities have to be defined at design time and semi-structured data is not catered for. For the editor s use case, the implications are that the application programmer will face some problems if he is not able to define an abstract entity for the content of the book. Figure shows how the application programmer could choose to design his relational model to take into account that the structure of the book appears and can only be concretized at the time of input. SQL does not standardize mechanisms which simplify the navigation through records during a session. Furthermore, there is no real context of position in the database which is conserved during a sessions and which can be reused simply. To navigate the application programmer is obliged to build a mechanism which is able to perform dynamic queries on the model. Therefore even if the model is extremely abstract and able to take into account all the possible situations, the application programmer is forced to develop all the application logic to navigate the structure. This task is by no means trivial. It is possible to make an implementation model which adds artifacts or miscellaneous entities to the records to create hierarchies, networks or explicit orders. However, this methodology exposes the application programmer to some conception failures, which are very difficult to correct once the system is in production. 4.6 Transactions In the current context, we can identify two levels of transaction. The transactions which deal with one resource and ensure that a sequence of changes can be considered as a unit of work can be considered as local. The others referred to as global transactions (14), deal with several resources and require a coordinator or a transaction manager to make sure that the changes can be committed to the pertinent resources. Figure 4.5-2: SQL and unstructured entity Figure 4.6-1: Global and local transaction

26 26 Specification comparison JCR Specification The JCR specification includes both cases. In a local manner, if the application programmer deals with only one repository instance, he can ensure that a sequence of changes can be considered as a unit of work. All the changes between two save calls can be considered as unit of work. Session.save(); Item.save(); Table 4.6-1: JCR and local transaction In an application, a content repository can be used as a resource in conjunction with other resources as a relational database, a messaging service or something else. The specifications mention that a repository implementation can be used in conjunction with the Java Transaction API (JTA). In a java container, when the Transaction API is used, the changes made on the JCR resource are determined only at the end of the transaction. features and most JDBC drivers can therefore be used with JTA. 4.7 Inheritance To enrich our use case with a wider panel of associations, we could consider a subsequent new requirement which implicates inheritance features. The editor wants to differentiate between his collaborators, his partners and his customers but he also wants to take into consideration that an individual can have several roles. JCR Specification For the inheritance requirement, node-types and mixin-types can be used. For example let us consider a Person node-type which has three mixin-types respectively customer, collaborator and partner. By taking one or more mixin-type, a node which has been defined as a person can take on all the roles encountered in the system. // Get user transaction (for example, through JNDI) UserTransaction utx =... // Perform some changes in a java content repository // Perform some changes in a relational database // Commit the user transaction utx.commit(); Table 4.6-2: JCR and global transaction SQL Specification The SQL specification allows the regrouping of statements as a unit of work. These statements will only be permanent in the database if they all succeed. This determines that local transactions as the one shown in Table are part of the standard. START TRANSACTION; (Statement list ) COMMIT; Table 4.6-3: SQL and local transaction However, using the database in conjunction with other resources is not taken into account by the specification. Some implementations provide statements to manage this kind of scenario similarly to the XA statement of MySQL. All the same, this can and is more often completed at a higher and more standardized level. Some APIs provide these Figure 4.7-1: inheritance semantic <editor = ' [editor:partner] > editor:person mixin [editor:collaborator] > editor:person mixin [editor:customer] > editor:person Mixin Table 4.7-1: node-types and inheritance The primary advantage is that queries made on the person node-type will return all nodes of this type and it will also including nodes which inherit from this node-type. All the properties of the returned nodes are immediately accessible and a node which was not considered as a person can also acquire this status through the mixin-type. SQL Specification Inheritance tends to be encountered at application level. However, some relational databases, for example PostgreSQL can have extensions which

27 University of Lausanne & Day Software AG JCR or RDBMS 27 manage inheritance. However these tools are not standardized and tend not to be used in practice. A classical way to administer this requirement consists of creating tables for each susceptible entity which will inherit characteristics from the person entity. The identifier of these sub entities is known as a foreign key which point to the person table. Figure visually represents how this could be implemented with SQL. 4.8 Access Control Access control can be defined as the action of authorizing or denying access, modification and creation of records. While this is nearly always a requirement in business applications, specifications rarely respond to real-time situations. In the editor s use case, it was mentioned that a person should be able to see a digital preview of the book and under certain conditions the whole book. This implies that books components can have different access policies. JCR Specification Since the 1.0 version of JCR (4), access control is one of the core feature. In its first release, the specification only declares how to login to the repository and how to check the permissions attributed to the items of the repository. The hierarchical path of the items stored in the repository is used as the basis on how to check these permissions. However, the specification does not specify how access control should be implemented and manage. Figure 4.7-2: SQL and inheritance It is quite easy to create a query which retrieves the entire set of persons and all their inherited properties. The one depicted in Table 4.7-2: SQL query and inheritance, shows how this can be done with left outer joins. Additionally a view can be created to avoid having to rewrite the query. SELECT * FROM person p LEFT OUTER JOIN partner pa ON pa.id=p.id LEFT OUTER JOIN collaborator co ON co.id=p.id LEFT OUTER JOIN customer cu ON cu.id=p.id; Table 4.7-2: SQL query and inheritance While JCR seems a more flexible way to express inheritance, this can lead to the conclusion that both approaches are approximately equal in expressing this kind of associations. However in reality it demonstrates that the advantage in JCR is that each node can inherit from several mixin node-type. With the annotation that this advantage relates more to the semi-structured approach rather than inheritance problems. Repository.login(Credentials cred); Session.checkPermission(String abspath, String actions); Table 4.8-1: JCR 1.0 and access control The version 2.0 of the specification (5) defined how the concepts of privileges and access control policies in the repository would function. Each item stores properties which relates to privileges. These properties can be modified through the API. Thus the access control feature can be delegated to the content repository which is able to manage the list of permissions at an item level. Session.getUserManager(); UserManager.addUser( ); UserManager.addGroup( ); Session.getAccessControlManager(); AccessControlManager.getApplicablePolicies(path); Policy.addEntry( ); AccessControlManager.setPolicy(path, policy); Table 4.8-2: JCR 2.0 and access control In both cases, this means that for the editor s use case, the application programmer will only have to define the structure and to use the repository

28 28 Specification comparison features provided to manage access control. The access control granularity proposed by the API is close enough to the data to address all the potential use cases. Consequently, further access control logic is not required. SQL Specification In SQL, access control is basically managed with the data stored in the information schema (10). This provides the ability to grant and deny privileges at a table or a column level. However, while the base functionalities provided by SQL allows the declaration of implementation models which manage permissions at a record level, there is no inherent standard solution provided. This comes from the fact that the identifiers of the records in relational database can be distributed across several domains. Conserving this property makes it difficult to specify a generic way to manage access control at a record level. Basically, for the editor s use-case, managing the readability of the information of which a book is composed imposes that access control should be administered at a record level. This is obligatory because the SQL specification does not provide this feature. The application programmer must therefore include it in his implementation model. The Figure shows the solution where each record has a unique identifier stored in a column. The record controller table allows for the identification of accessible resources within the database. The record_accessor table allows for the identification of the persons accessing the database, they can then be stored through out the database in a user or a group table. This model still means that the application programmer must manage and implement the logic which will perform the privilege checks. 4.9 Events Figure 4.8-1: JCR and access control Another requirement often encountered concerns the observation of the changes which can be applied to a dataset. At the infrastructure level, messaging services are common examples of components which make use of these types of events. Some use cases benefit from being event driven one such case would be the management of flows. The editor s use case could also benefit from this type of methodology. For example, the editor may want to notify some clients each time a new book is added to a specific collection. JCR Specification The JCR specification provides an Event Listener interface which traces all the imaginable operations which have to be performed when a specific event

29 University of Lausanne & Day Software AG JCR or RDBMS 29 occur. These listeners can be registered for different types of event for example: when nodes are added or removed for events which occur under a particular path, at a specific level for events which occurs on the instances of a node-type or on a single node identified by a UUID. The coded example presented in Table shows how an event listener can be registered for all the events which occur when a book is added to the computer collection. ObservationManager om = session.getworkspace().getobservationmanager(); EventListener el = new EventListener() public void onevent(eventiterator ei) { System.out.println("A book has been added"); } }; String[] nt = { "editor:collection" }; om.addeventlistener( el, Event.NODE_ADDED, "/collections/science/computer", true, null, nt, false); Table 4.9-1: JCR and observation This observation mechanism allows listening in on events with a fine granularity. Furthermore, the fact that the observation mechanism is provided directly through a java API instead a specific procedural language allows a high level of interaction between the application and the repository. However, an important aspect is that the listeners are not permanent. This means that if the repository is restarted, all the listeners have to be reregistered. In certain situations, especially those which occur when the event listeners are registered at runtime, the recovery of the application s state can be difficult and complex. SQL Specification The SQL specification addresses the observation problem with triggers. One of the main advantages of triggers is that they remain in the information schema. This ensures that the state of the database including the triggers can be easily recovered. Triggers can be registered for insert, update or delete operations which are visible on specific tables. The body of the trigger generally contains procedural calls which can be launched before or after queries. CREATE TRIGGER editor.book_insert AFTER INSERT ON editor.book FOR EACH ROW BEGIN (Statement list ) END; Table 4.9-2: SQL and triggers For the editor s use case the trigger shown in Table 4.9-2: SQL and triggers listens in on the registration of new books. However, it is not possible to listen in on only the events which occur in a subset of the table. In addition, there is no standard way to propagate the event from the procedural language to the application. Hence triggers are mainly used to modify data in the database following inserts or updates Version control Version control is often an issue when people are collaborating on the same data. It is therefore prudent to retain to memory the history of an object and to give the user access to the evolution of an object. For the case in question, we could imagine that after a certain lapse of time, the editor decides to manage in the system the different versions and editions of the books. JCR Specification Version control characterizes how content repositories are fully compliant with the JCR specification. The JCR specification includes versioning as a part of the standard. It can be supported for individual items and for hierarchies of items. This simplifies the life of application programmers who normally have to deal with these kind of needs. As shown in Table , managing versions of a hierarchy does not require an enormous effort. // mixin versioning type book.addmixin("mix:versionable"); session.save(); // version creation book.checkout(); book.addnode("chapter1"); session.save(); book.checkin(); book.checkout(); book.addnode("capter2");

30 30 Specification comparison session.save(); book.checkin(); book.checkout(); book.setproperty("isbn", " "); session.save(); book.save(); book.checkin(); // get the second version VersionIterator vi = book.getversionhistory().getallversions(); Version v; v = vi.nextversion(); v = vi.nextversion(); // restore the second version book.checkout(); book.restore(v, true); Table : JCR and version control SQL Specification Some relational databases implementations provide versioning functionalities. However, versioning is not part of the SQL standard. Any person wishing to build an interoperable application have to include versioning in their implementation model. Managing properly complex graphs in relational databases is quite difficult. So while versioning could be implemented this task would not be undertaken with SQL Synthesis It seems that for both specifications the structural part and the integrity parts are well defined. However, while the relational model provides very clear foundations for operations and queries, the JCR specification seems to provide operations and queries on a relatively obscure basis. The same remark can be made for navigation. While the JCR specification provide a strong navigational basis, the last versions of the SQL specification have difficulty to provide a coherent set of features which take this factor into consideration. Improvements could be made in these areas for both models with recommendations and enhancements being shared mutually. As an additional key aspect the differences between each specification is note worthy. Generally, it appears that the JCR specification is pragmatic in relation to the SQL specification. The features provided by JCR give practical answers to common and recurrent problems. Providing a standard way to solve running problems in a natural and elegant manner is not obligatory but by doing so this actually protects the application programmer from conception failures. Failures which could relate to the managing of versioning or access control. While relational databases implemented on the SQL specifications have the potential to represent all types of use cases which could appear in real life, They are often badly constructed due to the constraints which impact and govern a projects evolution or lifecycle. This does not detract from the fact that the relational model does contain a complete set of main building blocks for a database. At specification level, SQL makes extensive use of its base components to express its various extensions. Conclusions can be drawn from this, principally that a specification s foundation should be able to handle and manage all kinds of use cases and secondly that a specification should evolve and build onto its foundation and not away from it.

31 University of Lausanne & Day Software AG JCR or RDBMS 31 5 Development process comparison Figure : Agile and iterative development process Another perspective is taken in this chapter to compare relational databases and java content repositories. The purpose is to show the key differences between data models which impact the application s development process. These differences cannot really be measured but are significant enough to be mentioned. Agile development processes such as Extreme Programming, Rational Unified Process or Open Up divide project life cycles into steps such as inception, elaboration, construction and transition. These phases can be interactively executed. The process depicted in Figure : Agile and iterative development process summarizes a possible segmentation of the time taken for the Open Up development process. The following sections will make reference to these steps. The purpose is to show where and how both models, the JCR one and the relational one, can respectively impact this process. 5.1 Data Understandability Making architectural and implementation models understandable is one of the key aspects of the elaboration phase. Clear architecture which can easily be communicated allows people to enter more quickly into the project. It is also easier to define tasks and duties if the architecture is clear and made of separate modules. Generally the architecture is defined or refined by an architect or an analyst during the elaboration stage. This actor takes the requirement identified during the inception phase as input and delivers blueprints which explain the behavior of the system at different levels. At an application level, these blueprints generally include use case diagrams, collaboration diagrams or class diagrams. To show how the application s data persists, these schemas are often

32 32 /Development process comparison translated into database schemas which take the properties of the data model into account. JCR development As mentioned, the structure and the content are indivisible in JCR. However it is possible to define a semantic which shows how data and structure will be instantiated. In this semantic, some aspects of the content can be omitted. For example, if a semantic item has an unstructured basis, all the possible and imaginable properties can be saved under it. Thus, there is no need to mention them if they are not mandatory or don t have to respect specific constraints. It is enough to declare them in the application s schemas as made in a class diagram. Thus, the semantic diagram of a java content repository says less than the other architectural diagrams. This impacts its readability. In fact, reading the semantic of a repository gives a snapshot of the final application and helps to understand its general behavior. Relational development Class diagrams can be used as input to generate relational schemas. Entity-relationship diagrams (15) or Crow's Foot diagrams are often used to represent them. Translation rules are generally needed to produce these schemas. Far from summarizing the architecture, they enumerate to a high degree all the aspects of the final application. Figure 5.1-2: SQL translation Everything has to be explicitly mentioned in these database schemas. Only the records which respect the data structure can be instantiated in a relational database. Thus, it is necessary to carefully define this structure and make it fit in perfectly with the application architecture. Figure 5.1-1: JCR translation Another interesting aspect is that the complexity of the JCR semantic is not decupled by many-to-many relationships. No intermediary nodes or artifacts are needed to represent these associations. Thus, these diagrams are very much closed from the other architectural schema. No translation rules are needed to create them. Many-to-many associations cannot be represented in relational database schemas without reification. This means that many-to-many associations will always require intermediary entities. Consequently, the internal complexity of a relational schema increases faster than the complexity of the other architectural diagrams. Thus, they don t really help to understand the application. They are more often used as implementation s blueprints. 5.2 Coding Efficiency The construction phase of a development process is highly influenced by efficiency. Coding requires time, resources and money. These parameters are very sensitive. Furthermore, if developers have to write code twice, there is a high probability that they will make more than double the programming errors. Thus, efficiency also impacts quality.

33 University of Lausanne & Day Software AG JCR or RDBMS 33 Measuring coding efficiency implies some soft parameters. The programmer s education and knowledge should be taken into account. Furthermore, the semantic and the readability of the code are also significant. These parameters make it difficult to judge the technology s efficiency. Without going too deep into these questions, the following sections contain useful information which can be taken into consideration when making a decision in this area. JCR development Programmers are not really familiar with the JCR API and don t really know the best practice linked to content repositories. However, the API is in large part self-explanatory and people generally have the habit of thinking in terms of hierarchies. These parameters should give to JCR a good learning curve. Some interactions are possible between the query part of the API and it s navigational part. One of the big advantages of JCR is stated in the fact that these aspects are merged coherently and are not considered as different abstraction levels. The code quantity highly relates to the use case. If complex joining operations are mainly required, JCR will not be an efficient choice. However, if navigation is required, the size of the code will be much smaller. If special requirements such as versioning or fine grained access control are needed, it becomes clearly difficult to reach the same level as the one proposed by JCR. Relational development Nearly all programmers are familiar with the relational model and people have often used it in recent years. Thus, SQL and API as JDBC are part of the common language. In real world situations, this general knowledge often favors the relational model. Some problems need to be treated in a specific manner and the intuitive approach often gives bad results. If complex operations are required by the use case, the relational model should not be bypassed. The completeness of the queries and the panel of operations made it very efficient in term of code quantity. However, if the use case implies requirements such as navigation or versioning, the developer will have to add some artifacts into his implementation model to manage parameters such as tree structure or order. He will also face the problem of having to implement huge applicative logic. Thus, in terms of efficiency, the model s choice should be driven by an honest analysis of the use case s properties. 5.3 Application Changeability Requirements which appear during the development process are often difficult to include in previously defined architecture. Modern software development processes generally address this problem with iteration cycles (16). Well managed, iterations should allow to include efficiently new requirements. However, because each logic level is generally impacted by architectural changes made during the elaboration phase, last iterations are more expensive than early iterations. Decoupling clearly logic levels can reduce this increasing cost. Thus, data models which can transparently accept changes are really appreciated. To make this point, we will consider how simple changes are impacting the data logic of a system. JCR development As mentioned in the Schema understandability section, repository s schemas summarize the other architectural diagrams. While this could appear meaningless, it is really not the case. Keeping the repository as weak as possible allows and includes new requirements without touching the data logic level. Only the application logic level is impacted. Thus, adding a property at an application level doesn t necessarily require or touch the repository s organization. To be sure, deep changes impact data logic and JCR, and they do not provide a magic solution either. The JCR allows for a decoupling of most of the data logic from the application and the interface levels. It is also interesting to note that frameworks like Sling allow decoupling in a similar manner to the application logic from the interface logic. This

34 34 /Development process comparison approach is clearly an attractive one, especially in environments driven by changes and agility. Relational development Nearly each modification made on the overall architecture will impact the data logic level. This comes from the fact that relational databases do not allow for instantiate elements which have not been previously defined in the structure. Thus, there is a great probability that a change made in a formulary of the interface or in the application logic will require perform changes on the data model logic. Some frameworks provide tools to automate these changes. However, if the system has a production version, once executed the change will have a big foot print on all the database s items. Furthermore, classical model-view-controller frameworks are not really decoupling the applications level from the interface. For example, a change made on a controller will often impact on views and models. 5.4 Synthesis changes into their environment. In situations where some changes have to be performed the semistructured nature of JCR will certainly be appreciated. Furthermore, the inclusion of features such as navigation, versioning or access control can gain us a lot of time. Nevertheless, it is important to keep in mind that the efficiency of both solutions relates in a large way to the nature of the use case. The agility of JCR should not influence this aspect. Furthermore, the agility is inked in no small way to the project team. Thus, saying that JCR is a way to achieve agility is a too big a shortcut. In all cases, the choice of a database technology should always be discussed during the inception and elaboration phases of the first iteration of the development process. This can be done by leveling the different parameters. Changing the persistence technology cannot easily be achieved after the first iteration. Consequently, this choice will have a strong impact for the rest of the project. At a project level, people are often looking for solutions which will allow for the quick integration of

35 6 Product comparison University of Lausanne & Day Software AG JCR or RDBMS 35 Choosing between database products implies that we use different criteria. We can mention the compliance with a standard, the additional features proposed by the provider, the support offered by a company or by a community or the scalability of the solution. All these criteria have an importance. They should be weighed carefully and a choice made depending to the situation. In our context, basic and significant differences distinguish java content repositories from relational databases. Thus, a decision to employ one technology instead of another should be taken at a lower level. However, in relation to the product, people often ask in terms of performance, if they should use a relational database or a java content repository to manage their hierarchical information. This section will try to address, and answer this issue by reminding us of some basic theoretical concepts which relate to data structures and to the cost of associations. Then, at a more practical level, a benchmark of several database products will verify if these assumptions can be proved. 6.1 Theoretical analysis In general, database products use basic data structures to manage their data. This section reminds us of simple concepts which relate to these structures and to the cost of associations made between data items. The goal is to determine if the product s performances will be significantly impacted by the subtended approach. Creating an association between two nodes also has a constant cost because the number of operations needed to perform this is always the same. Thus, the cost of crossing and creating associations is constant and could be noted as O(1) in big O notation. Some people say that these associations are pre-computed. Some strategies allow the representation of directed graphs such as those needed by the hierarchical and the network models. The most classical representations of this are adjacency lists and adjacency matrixes (17). Generally, the choice between one approach instead of another is made simply by analyzing the density of the graph. If the graph has a number of arcs which are close to the square of the number of edges, selecting an adjacency matrix will show a better result. However, the JCR model is mainly driven by hierarchical associations. In this context, the number of arcs will not be a lot taller than the number of edges. Thus, an adjacency list will show more respect for the memory usage by requiring only the space needed to store the associations. It is also interesting to note that this kind of organization allows, with a certain amount of ease, the giving of an order to the children of a node. Hierarchical and network database In the hierarchical and network models, associations are made by storing references or pointers between items. The advantage of this kind of structure is that, because each node stores direct references with other nodes, a constant number of read accesses are needed to go from one node to its target.

36 36 Product comparison in the target. However, most database products provide indexation facilities such as b-tree indexes. So, in most cases, finding the matching entries has a cost of O(log(n)). While b-tree indexes are good, some articles (18) argue that in the network models, because associations are pre-computed, it is possible to reach better performance. However, in most cases there is no need to use other comparison operators other than = or to express relationships as these are presented in a hierarchical or network model. Consequently, hash indexes can be used on the domains which constitute the association. If the relational database provides good hash indexes implementations, the cost of retrieving data through associations will be close to O(1). It also results in a constant cost of O(1) when new items are added to the targeted sets and in the index. Thus, there are virtually no significant differences between the associations of the relational model and of the hierarchical model. 6.2 Benchmark Figure 6.1-1: A hierarchy and its adjacency matrix Implementing this with a programming language can be accomplished by using several data structures such as arrays, maps or hash-tables. Some other solutions could also be presented but the main idea is that crossing an association has a constant cost and that crossing a graph has a cost which is proportional to the number of arcs and edges traversed. Thus, managing this kind of data is cost effective. Relational database In the relational model, associations are made between relations by computing the matching values stored in two domains. This allows for the expression of all imaginable associations between two or more data sets. What is the cost implication of computing and creating associations in a relational database? To compute an association, a relational database has to cross the targeted set to find the matching values. In this case, the cost of the association equals O(n), with n the number of tuples stored in the source and The previous section has summarized very succinctly and too quickly a huge problem. However, the main point to keep in mind is that intolerable differences should not appear if hierarchical data is managed with a content repository or a relational database. The following benchmark has been done to verify this assumption. Four products are included in this benchmark. CRX is a native implementation of the JCR specification. The persistence of the items is managed with a proprietary technology which is based on the tar file compression (19) and implemented with java. H2 and Derby are two open source relational databases written in java. MySQL is one of the most widely used open source databases. A simple wrapper has been defined for this benchmark. This wrapper proposes basic functions to create trees made of nodes and properties. The CRX wrapper uses directly the functionalities provided by the API. The SQL wrapper uses a simple database schema. One table stores the nodes and the other table stores the properties. The associations between items are managed with a parent foreign key and the default indexes of the

37 Milliseconds Milliseconds University of Lausanne & Day Software AG JCR or RDBMS 37 database are used on all fields. JDBC allows performing queries and prepared statements to avoid parsing the SQL statements each time. The benchmark is composed of four parts which all measure the time required to perform an operation in hierarchies of different sizes. Each node of these base hierarchies has 5 sub-nodes and 5 properties except leaves which only have 5 properties. The first hierarchy has one level. The following ones always include one more level. The tests have been launched 5 times on a Dell Latitude D820 installed with windows XP (processor: Intel Core Duo 2.00 GHz, virtual memory: 2.00GB). The average result is used in the following diagrams. crx h2 mysql derby Writing the hierarchy Items This test measures the time required to create the base hierarchy. The throughputs correspond to the time needed to write one item of the hierarchy. While the differences seem huge, all the throughputs are constant. The assumption that native implementations of JCR and relational databases should be equivalent in term of performance is true in this case. MySQL cannot be embedded in the application. This has a high impact on the result. H2 does not appear in the chart because its performance for write accesses is too good. crx h2 mysql derby Reading the hierarchy This test consists to read once all the items of the base hierarchy from the root to the leaves. The throughputs displayed in the chart correspond to the average time needed to read one item of the hierarchy. For most databases the results seam to be constant. Derby is just out of range. When recursive queries are performed on this database, the results are not tolerable Items

38 Milliseconds Milliseconds 38 Product comparison Randomly writing the hierarchy crx h2 mysql derby The test consists of randomly writing 100 sub-hierarchies in the base hierarchy. Each sub hierarchy has a depth of 2 levels. Each level has two sub nodes and two properties. Thus, each sub hierarchy is composed of 21 items. The throughputs relate to the average time required to create all the items of one sub-hierarchy. The results of the first test are quite similar to this one. The good point is that all the databases have constant results Items Randomly reading the hierarchy crx h2 mysql derby The test consists of randomly reading 100 nodes and their descendants on two levels in the base hierarchy. The throughput relates to the average time required to read one node and its descendant. As in the second test, Derby is just out of range. The same problem is encountered with recursive queries. It appears that CRX is well optimized for these situations. To be really pertinent this test should be launched on bigger hierarchies. However, the difference between the results is constant and relational databases are not showing extremely bad performances for recursive queries Items 6.3 Synthesis As shown in this chapter, performance should not be used as the main argument to choose one technology over another. The aspects mentioned in the previous chapters are more important. The choice should relate to the nature of the problem which has to be solved and not to the nature of the product. The assumption that relational databases are able to effectively manage hierarchical data is true. However, this does not mean that java content repositories should be implemented as a layer over relational databases. Some base concepts of both specifications are in a mismatch and make a relational schema for JCR, which include all the aspects of the specification, will look unsuitable. More modularity (3) in the database world could benefit from both approaches. While this goal is not achieved, native s implementation of JCR is probably the better of the proposed solutions.

39 University of Lausanne & Day Software AG JCR or RDBMS 39 7 Scenario Analysis The following diagram synthesizes the main aspects pointed out during the whole comparison process. Four use cases characterized by different features will be shortly analyzed in regard to their respective requirements and to the presented approaches. Data Model Level Structure JCR Unstructured Semi structured Structured RDBMS Structured Integrity Entity integrity Domain integrity Referential integrity Transitive integrity in hierarchies Entity integrity Domain integrity Referential integrity Tools to manage data coherency Operations and Queries Selection Equi-join operations Full text search operation Transitive queries on hierarchies Selection Projection Rename Join operations Domain operation Create, read, update, delete statements Navigation Navigation API Traversal access Direct access Write access Not supported Specification Level Inheritance Node types inheritance Node inheritance Not supported Access control Record level Table and Column level Record level not supported Observation Record level Un-persisted event listeners Application interaction supported Table level Persisted triggers Application interaction not supported Version control Supported Not supported Project Level Schema understandability DataGuides or Graphs Summarize the architecture Not impacted by many-to-many associations Entity Relationship Represent the whole architecture Impacted by many-to-many associations Code complexity Simple for Navigation Complex for Operations Complex for Navigation Simple for Operations Changeability More agile Decoupled from the application More rigid Coupled with the application

40 40 Scenario Analysis 7.1 Survey An agency wants to implement an application which is able to carry out surveys over the web. This tool should be able to allow for the collection of data from questionnaires, to configure the type of answers, and to aggregate the survey s results in a suitable form. Main characteristics of the application: All the entities can easily be identified at the design time. (Structure) Some verification has to be made on the data. (Integrity) The results aggregation implies complex operations. (Operations and Queries) Once in production the application will not evolve to a great degree. (Project) The choice of a relational database for this kind of scenario is probably the best alternative. The features provided by a content repository will not really be used. Furthermore, programming operations will only add complexity in the application. 7.2 Reservation An event organizer wants a portal which gives the opportunity to buy tickets for events. The event organizer should be able to create the events characterized by a name and a short description. The customer should be able to browse and search the event s catalogue and to order tickets. On the other hand, the event organizer wants to monitor his sales and manage his prices depending to the success of the event. Main characteristics of the application: All the entities can easily be identified at design time. (Structure) Some verification has to be made on the data. (Integrity) Monitoring the sales can imply some operations on the dataset. (Operations and Queries) Browsing and searching the catalogue require traversal and direct access. (Navigation) As a strategic application, the application is subject to improvements. (Project) This application has strong needs which relates to the relational database world. The clear structure linked to the management of orders and events could lead us to conclude that a relational database is the ideal candidate. However the need of navigation and the potential extensions linked to the catalogue could benefit from the features of a content repository. A balanced approach could consist of storing the orders in a relational database and using a content repository for the events catalogue. This also fits in particularly well with the fact that the catalogue will mainly be subject to read access and the ticketing service to write access. This should not be a problem because complex interactions between the JCR and the RDBMS can be managed with the Java Transaction API. Making hybrid decisions can in certain contexts allow us to benefit from both applications, thus having the best of both worlds.

41 University of Lausanne & Day Software AG JCR or RDBMS Content management A publisher wants an application to be able to manage all the content generated by its collaborators. The content will be composed of videos, photos, text or anything else. Several taxonomies should be available to organize the content. The main purpose of the publisher is to offer a coherent set of features which allow for the easy retrieving of resources for each type and to enable the reuse of them in different contexts or in other publications. Main characteristics of the application: The editor wants to take into consideration that new entities of content could appear. (Structure) The main verifications regarding data concerns virus. (Integrity) Searching requires full text indexation. (Operations and Queries) Taxonomies imply simple operations. (Operations and Queries) Exploration is needed everywhere. (Navigation) Future improvements could imply versioning, observation and access control. (Specification features) The system will continuously evolve with the enterprise. (Project) The flexibility and the features provided by JCR are typically made for these types of scenarios. Content as understood here is difficult to store in a relational database. Furthermore, all the complex requirements such as versioning or access control can be included during the application life cycle without too much of a problem. 7.4 Workflow An editor wants to manage the interactions of his collaborators. The situation could be the following: The editor in chief and the board decide which subjects have to be treated in the next edition of a publication. These subjects are communicated to the workforce (journalists and photographers). Once edited, the articles are sent for proofreading. Once corrected, the editor in chief is notified. He decides if the article can be published or not. If the article will appear in the publication, it is sent to a typography service which produces a model which includes pictures. Once the publication integrates all the articles and all the pictures, the editor in chief will read it once again and take the decision to publish it or not. Main characteristics of the application: The entities are composite and difficult to design. (Structure) The structure mainly involves graphs. (Structure) Editing and exploring the process implies traversal access. (Navigation) Notifications imply to observe local events. (Observation) Notifications imply interactions between the data model and the application. (Observation) This kind of scenario involves semi-structured models in conjunction with good observation capabilities. While the other features proposed by JCR such as versioning or access control do not directly find an application, the foundations of the model will really be appreciated in this case. The workflow structure can be directly designed with nodes and items and once instantiated the workflow will clearly benefit from the observation mechanisms proposed by JCR.

42 42 Conclusion 8 Conclusion The choice of a data model or of a database is often arbitrary. Sometimes, specific technologies are imposed by an enterprise policy or simply by irrational preferences. When the time comes to choose a technology, the good arguments are not often put forward. Furthermore, the myth of a general multi-purpose database is still ingrained in some minds and people are always looking for a magic solution which can be used in all imaginable circumstances. Today, the cohabitation of several infrastructure components can be achieved with minimal effort. A platform such as J2EE provides tools to manage distributed resources. In this context, the choice of a data model or of a database should not be reduced to an arbitrary decision. As shown in the Scenario Analysis chapter, a pragmatic analysis gives quick results. The technology which fits in best with the requirements can be identified and used to the greatest effect. In some cases, hybrid strategies can also be adopted. A coherent choice can lead to significant advantages and this question should always be discussed during the early phases of each project. Relational databases have been successfully used for several years. However, the growing power of the user and the rigidity of the relational approach make it difficult to implement features which are actually required by some applications. It s possible to push the boundaries of the model but the constraints of time and money make it difficult to do so correctly. Some frameworks are partially effective in solving these problems. Depending on a middleware layer for features such as access control, navigation, or versioning only push the hot potato at a higher level. This does not really solve the problem but adds complexity to the overall environment. Java content repositories cannot replace relational databases in every situation. Actually, the features proposed by the API fit very well with all the requirements encountered in content management and collaborative applications. Nevertheless, JCR enriches the debate around databases and data models in relation to two important aspects. Primarily JCR includes some features at a data model and specification level. Secondly the specification is aware of its environment and takes into account that java content repositories can be used in conjunction with other infrastructure components. This is not the case for a specification such as SQL. This tendency seams relatively new but will probably be consolidated during the next few years. With a position of precursor, Day can play an important role in this debate and will gain in notoriety. Some challenges will arise with the growing popularity of the JCR specification. Selecting good opportunities should allow for the database field to make its mark. This in its turn will create a footprint that will overflow into the world of infrastructure components.

43 University of Lausanne & Day Software AG JCR or RDBMS 43 9 Appendix JCR and design As mentioned in the data model comparison chapter, a Java Content Repository schema is dynamic and evolves with the content. The structure appears when nodes and properties are instantiated. However, during the development process the need to establish a semantic for the repository appears. Several publications which treat semi structured approaches propose solutions in how to represent these schemas (20) (21). These representations are called DataGuide (DG) or Approximate DataGuide (ADG). The lesser elaborate version can capture visually the organization of semi-structured databases. The JCR specification (4) use graphs to represent the example of the structure which can be found in the content repository. DataGuides and other graphs notations fit particularly well with Java Content Repositories but are not expressive enough to be used as implementation blueprints. The goal of this appendix is to summarize the possibilities offered by JCR to organize content and to enrich the notation proposed in the specification which needs to communicate the whole semantic of a repository. 9.1 Model The most common relationship provided by the model is the composition. Semantic items can be instantiated as node and properties. A node can be composed of sub-nodes and properties. A property can only be composed of values. Except for the root node, all other nodes and properties are components. Some as seen allow for the creation of horizontal relationships between the branches of a hierarchy. A common relationship is achieved by storing one or more paths values in a node property. This method has an advantage because the hierarchical property of the target can be used in queries. Another relationship consists to store one or more UUID values in a node property. The maintains the validity of the link even if the target is moved. Any one of these approaches could be used and be appropriate depending on the context. 9.2 Convention Semantic items which will be instantiated as node or properties are respectively represented by circles and boxes. The circle s label refers to the node-type, the box label to the property-type. Without a label, the circle or the box means that the node can be found. An empty circle means that everything which is not mention is allowed under the semantic item (black list). A barred circle means that everything which is not mention is not allowed (white list). An empty box means that the property is simple. A box which contains a M means that the property can store multiple values. The composition of associations is represented by filled arrows which link two semantic items. The arrow s label refers to the relative path which links the two semantic items. Only descendant relative paths are allowed. Stars (*) and variables (<variable>) can be used to express pattern in the path. Without a label, the arrow means that a semantic item, as the one targeted, can be found everywhere under the source. The arrow can end with a cardinality (1..N). Without cardinality the meaning is N. Horizontal associations are represented by dotted arrows. They always start from a box and finish on a circle. No labels are put on these arrows. They are

44 44 Appendix JCR and design only used to give implementation information. The arrow can end with a cardinality (1..N). Without cardinality the meaning is N. Inheritance associations between semantic items can be represented by empty arrows. They should always go from the bottom to the top. No labels are put on these arrows. The elements which are represented with a bold style are mandatory. If specific constraints have to be declared they can be shown as comments in the diagram. 9.3 Methodology Designing a JCR semantic can be made with different approaches. If a development process is used, the semantic will be obtained by translating the applications diagrams. The approach proposed here consists of six steps which can be iteratively be executed and which result in a semantic blueprint which can be implemented in a repository. Input Output Activity Step 1 Identifying the semantic items Existing semantic Requirement Semantic items Identifying the concepts which relate to the requirement and which have to be localized in the repository. Step 2 Identifying the inheritance relationships Existing semantic Requirement Semantic items Inheritance semantic Identifying inheritance relationships between the semantic items. Step 3 Identifying the hierarchical relationships Existing semantic Requirement Semantic items Hierarchical semantic Identifying hierarchical relationships between the semantic items. Thinking in term of composition. Step 4 Identifying the horizontal relationships Existing semantic Requirement Semantic items Horizontal semantic Identifying horizontal relationships between semantic items. Identifying relationship s types. Thinking in term of association or aggregation. Step 5 Defining cool structure artifacts Existing semantic Requirement Hierarchical semantic Horizontal semantic Organizational semantic Identifying the patterns which link hierarchical semantic items. Step 6 Carefully defining the integrity rules Existing semantic Requirement Semantic items Inheritance semantic Hierarchical semantic Horizontal semantic Organizational semantic New semantic Only if necessary, declaring in the semantic the level of coherence which has to be preserved at a repository level.

45 University of Lausanne & Day Software AG JCR or RDBMS Application Based on a very simple use case, this section shows how the methodology and the notation previously defined can be applied. The purpose is to deliver a blueprint which shows how data is organized and all data aspects required to build the application. The specifications of the case are as follows: A blog application deals with posts. A post always stores its creation date and should contain some information such as text, images, etc. A post can belongs to zero or one category and can have zero to an infinite number of tags. A category can have subcategories. From any category it should be possible to find all the posts which relates to it and to its subcategories. When a category is deleted, the related posts are not deleted. Anonymous readers can respond to posts with comments. For navigation, it may be useful to organize posts by years, months and dates.

46 46 Appendix JCR and design Output Comments Step 1 Identifying the semantic items Properties do not have to be localized in the repository. Step 2 Identifying the inheritance relationships The requirement does not contain inheritance but we could imagine this kind of relations. Step 3 Identifying the hierarchical relationships Post and categories are not linked with a composition relationship. Step 4 Identifying the horizontal relationships To satisfy the requirement, posts are linked to categories with path values and with UUID values to tags. Step 5 Defining cool structure artifacts The year, month, year pattern is part of the hierarchical association. Step 6 Carefully defining the integrity rules In our case we only have to ensure that a post always has a creation date.

47 University of Lausanne & Day Software AG JCR or RDBMS Appendix Going further Only a few subjects have been mentioned in this report. This appendix presents three fields which relates to JCR and to databases in general. These fields could benefit from being studied in more depth. Furthermore, some existing products could be improved if these questions were addressed Queries in semi-structured models In the JCR Model, the notions of sets, relations and domains, which provide the means of expressing first order logic statements over the model, are present but currently not formally defined. It seems that at the present, node-types are seen as relations, properties as domain, nodes as tuples and properties values as attributes. The fact that these notions are well defined in relational databases procures advantages. For example, on this basis, some databases engines are able to analyze queries and to optimize them in regard to the structure. In semi-structured databases, query optimization is a known issue and research is still being conducted in this area (20). It is currently not clear if mapping as proposed by JCR could ensure more efficiency when queries are performed. Greater work on this question and further improvements of the JCR s query model could be a very interesting field of investigation Queries on transitive relationships The model proposed by JCR stores the hierarchical paths of each node. This allows the performing queries on transitive relationships in hierarchies by using the path property. Assuming a tree structure limits the whole number of paths to the number of leafs. Doing the same for horizontal relationships is a bit more problematic. To summarize, in a network structure, pre-computing all the paths is not proportional to the number of leafs but to the square of the number of nodes (11). The storage capacity required to store the transitive paths between the nodes also grows in a similar manner. Some use cases such as those which involve social networks need to store these kind of relationships. Defining a standardized way to manage this could be very useful in some situations. However, it demands that some research be made on finding the best algorithms and solutions which relate to this problem. Furthermore, query languages based on first order logic are limited when having to define queries on transitive closures and transitive relationships in general. It is in this measure and area that improvements still have to be accomplished Modular and configurable databases As shown in the product comparison chapter, the relational model is able to manage efficiently hierarchical relationships. Therefore is it really necessary or intelligent to implement, from the ground up, a data model which can be constructed from another, with approximately the same results? Some reasons could lead to this conclusion. However, the base differences between JCR and SQL cannot be omitted. For example, does it make sense to create a procedural API over a declarative

48 48 Appendix Going further query language which will be retranslated in declarative calls in the database? While the cost relating to the parsing of a query is insignificant, it is also a good reason indicating that it is preferable not to proceed in this manner. In reality databases are presently used with many different purposes in many different contexts. A few applications are embedding databases to manage small data sets in single client applications while others are dealing with thousands of connections and scalability problems. In this context, a multipurpose monolithic database is unimaginable even mythological. Margo Seltzer promotes a more modular and configurable approach to build databases (3). These recommendations lead developers into using database components at different level depending on their needs. JCR and SQL are two high level backend solutions which have possibilities but also limits. Their significant differences do not mean they do not have common denominators. More modularity in their architecture could give a better understanding of their behavior. This could also allow them to share components and to be adapted more easily to specific requirements and contexts.

49 11 Bibliography University of Lausanne & Day Software AG JCR or RDBMS Tsichritzist, D. C. and Lochovsky, H. Hierarchical Data-Base Management: A Survey. New York, New York : ACM, CODD, E. F. A Relational Model of Data for Large Shared Data Banks. San Jose, California : ACM, Sestzer, Margo. Beyond Relational Databases. ACM Queue. New York, New York : s.n., Nuescheler, David and Piegaze, Peeter. Content Repository API for Java Technology Specification. s.l. : Java Community Process, 11 May version Content Repository API for Java Technology Specification. s.l. : Java Community Process, 2 July version 2.0 Public Review. 6. Mazzocchi, Stefano. Data First vs. Structure First. Stefano s Linotype. [Online] July 28, Chaudhuri, Surajit. An Overview of Query Optimization in Relational Systems. Redmond, Washington : ACM, Buneman, Peter. Semistructured Data. Tucson, Arizona : ACM, Aho, Alfred V. and Ullman, Jeffrey D. Universality of data retrieval languages. San Antonio, Texas : ACM, Database Language SQL. Information Technology. [Online] July 30, xt. 11. Li, Zhe and Ross, Kenneth A. On the cost of Transitive Closures in Relational Databases. New York, New York : Columbia University Press, Tropashko, Vadim. Trees in SQL: Nested Sets and Materialized Path. DBAzine.com. [Online] April 13, Bachman, Charles W. The Programmer as Navigator. Waltham, Massachusetts : ACM, Distributed Transaction Processing:The XA Specification. s.l. : The Open Group for distributed transaction processing, CHEN, PETER PIN-SHAN. The Entity-Relationship Model-Toward a Unified View of Data. Cambridge, Massachusetts : ACM, Introduction to OpenUP. OpenUp. [Online] October 27, Cormen, Thomas H., et al. Introduction to Algorithms, Second Edition. Cambridge, Massachusetts : The MIT Press, Bates, Duncan. Embedded databases: Why not to use the relational data model. Embedded Computing Design. [Online] January 01, Müller, Thomas. CRX Tar PM. dev.day.com. [Online] Day Software AG, November 11, ml. 20. Goldman, Roy and Widom, Jennifer. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. Palo Alto, California : Stanford University Press, Approximate DataGuides. Palo Alto, California : Standford University Press, Nuescheler, David. David's Model: A guide for blissful content modeling. Jackrabbit Wiki. [Online] August 22, Priti, Mishra and Margaret, Eich. Join Processing in Relational Databases. Dallas, Texas : ACM, 1992.