Document Management in e-freight based on Cloud Storage Architecture

Document Management in e-freight based on Cloud Storage Architecture Bill Karakostas INLECOM 15 June 2009 This is one of a series of architectural documents that describe a Cloud approach to collaborative operation and management transportation chains. This document describes a Cloud architecture for collaborative management of transportation documents. e-freight consortium P a g e 1 15June10

Main Business Assumptions and Rationale Every distributed software architecture needs at minimum to provide some answers to the following three viewpoints/considerations: What computational processes are performed how they are allocated to components/subsystems and how they are managed ( execution, control, coordination ) What types of data are handled and how they are managed (storage, replication, updates) How are communications/synchronisation between processes/subsystems achieved (i.e. communication architecture through shared memory, shared queues, 'shared nothing', etc). In this architectural document we are concerned with the second viewpoint, i.e. what are the types of data and how they will be managed in efreight's Collaboration Cloud architecture. In previous documents we explained that the Cloud architecture we propose is for (interorganisational) collaboration in managing transportation chains. This means that the emphasis is on sharing data and processes across transportation network participants, using the inherent properties and advantages of a Cloud solution such as no central points of potential failure, elastic computing resource availability and less dependency on physical hardware and software solutions (I.e virtualisation of IT resources through the service concept). In collaborative processes, documents are mechanisms for capturing information about the status of the process that is carried out. Some of this information is used internally by the participants in order to carry out and coordinate activities- other information needs to be captured for auditing or compliance purposes. In this document we explain how such information can be defined, shared and managed using lightweight Cloud (Web) technologies with little or no reliance on specific data management technologies such as RDBMS and other traditional middleware systems. Main Assumptions Communication between freight partners (and also with administrations) is carried primarily via documents. Companies in logistics (and in every other business area for that matter) communicate using documents. A business document is a natural and intuitive concept and it is a good idea not to dilute it (too much) with IT concepts. Since the era of electronic communications many organisations now exchange electronic documents in the form of emails, EDI messages and so on. Recently, XML is used to structure the content of electronic communications. We argue that such technologies are primarily for the benefit of the IT infrastructure - not the business processes. For example, a real person can recognise an invoice document because of structural and contextual information contained in the physical document. A person does not need the XML <invoice>..</invoice> tags to understand where the invoice begins and ends. If the invoice has e-freight consortium P a g e 2 15June10

a contact number for inquiries, the person who reads the invoice can easily detect that. If that number is missing, the invoice can still remain perfectly legible and valid for business purposes. More importantly, all or almost all the information needed to understand and process the invoice is contained in the invoice itself. The information about the sender of the invoice, for example, is captured on the actual invoice- it does not need to be looked up in another document. Contrast this with organisations that have IT systems to handle invoices. If the invoice is handled electronically, it has to be in a special format e.g XML or EDI so that it is processable by the systems who handle invoices Invoices (or the data that make up an invoice ) will be stored in some information system, typically a RDBMS. Information contained in the invoice will be broken down (normalised) and stored in different tables. The schema (organisation of such tables) usually differs across companies, because their systems are designed differently. So the way to store an invoice will differ across companies- every data design is unique. Even if the invoice is stored as an XML file instead of a RDBMS, transmitting it to a different company will require some form of processing (reformatting/mapping to a different XML schema), unless the two companies have agreed or happen to use the same xml schema for invoices. Main Concepts The proposed approach uses a small number of Web/Cloud concepts that tend to simplify the concept of collaboratively managing transportation related documents and so coordinating transportation processes. The approach is based on lightweight technologies that are devised to scale up to meet with elastic demand in transportation processes, thus making them suitable for a Cloud implementation. These technologies originate in research into web storage conducted by Google and then taken forward by projects like Couchdb.The main concepts of the Cloud based document management approach are described below. Database A database is a collection of documents. For documents to be stored in the same database they need to have some affinity with each other. However such affinity should be a business concern, not an IT imposed one. So its up to the user how to logically group the documents in databases in the way that makes business sense. Keeping in mind that this databases in e- freight will be shared and replicated, they should be used for storing documents that will be shared with other participants in the transportation chain. Thus a typical database of a company involved in e-freight will contain transport instructions, waybills, delivery documents, invoices and so on, related to the activities of the company as part of transportation chain(s). Example: A Freight Forwarder (Consignor) maintains a database of Consignments. This database contains documents about consignments, both individual consignments and consolidated consignments. The Freight Forwarder shares this database with its customers, the Shippers, e-freight consortium P a g e 3 15June10

or their Agents and with the Consignees. The database owner allows other users (e.g shippers, consignees,..) access to their own consignment documents but not to the documents about consolidated consignments. Properties of Databases: Databases are lightweight data storages that can be deployed locally or on the cloud and can be easily replicated and scaled up. Databases are created and managed using similarly lightweight and standard HTTP based commands (PUT, GET etc). Databases authors can be companies participating in a Transportation Chain (e.g. a Shipper, Freight Forwarder, Carrier etc). The following principles apply: There can be multiple databases Each Database is described by a unique name A Database has no structure/design. It is a managed collection of documents (see below) A Database can be replicated or copied by other parties subject to authorisation rules. A Database can be managed using a simple Web interface and HTTP based commands. Documents can be added, edited and deleted from the database. A Database can be stored locally in one of the user's systems or on the Cloud or in any combination (e.g one copy at the systems of the database author with cloud replications) Databases are managed by Cloud Document Management Services (DMS) Document A Document is the equivalent electronic concept of a real world logistics related document such as a delivery notice or a waybill. A document is a collection (can be nested) of statements about the transportation chain and its context, that will or have been true at some point in time or that can invoke some behaviour by a computational entity. Such statements are defined inside the document as {key : value pairs. Keys and values are JSON data types. (JSON is a lightweight data representation language for the web). Because nested key value pairs are allowed, documents can be nested. Documents can be stored indexed and retrieved efficiently in the database structures described above. Documents can be processed easily using Javascript because of the connection between JSON and the Javascript language. Because of the ubiquity of JSON format, documents can be processed in almost any programming language and environment. Because a document is an electronic one (although closely based on a real one) a document also has: e-freight consortium P a g e 4 15June10

a UUID (globally unique identifier) that distinguishes it from all other documents an author who is identified using another unique resource identifier (UUID( used for uniquely identifying organisations in a transportation chain. a version information optionally, a reader access list stating who can read and access the document. Documents can reference other documents via their UUID Documents can be edited; when a document is edited and saved back to the database, a new version of the document is created, but the old version is also kept. Documents can be edited at different places by different authors at the same time. All updates to the document are made consistent by the DMS that handles the database using version control techniques. It is always possible to find what the latest version of the document is by using the document's revision id. Once a document is entered in the database it can be retrieved using its unique UUID that together with the database name constitute a unique identifier (URI). For example a Shipping Instruction document with UUID 6E09886B-DC6E-439F-82D1-7C83746352B1 stored in the database called CompanyXShipments on the efreight 'Cloud' can be uniquely referenced using the URI http://efreight.org/companyxshipments/6e09886b-dc6e-439f-82d1-7c83746352b Once retrieved, a document can be edited (e.g by being converted back to JSON objects and processed using Javascript) and then stored as a new version back to the database, using appropriate HTML (PUT) commands. View A View is a computational mechanism that acts upon documents or document collections to create subsets of such documents or collections. Unlike documents the outputs of views are not stored in the database. Views are used to filter documents in the database to find those useful for a particular task Views can also be used to extract data from documents to present in a specific order Finally views can be used to calculate using the data in the documents Example The view below scans all documents in a database and returns the user names of the authors of the documents that are Shipping Instructions. e-freight consortium P a g e 5 15June10

function(doc) { if(doc.type == "Shipping Instruction") { emit( doc.username, doc ); Dissecting a business document Because of the underlying JSON based representations, documents have the advantage of being conveniently packaged for storage rather than split out across numerous tables and rows as it would be the case if they were stored using conventional databases systems. Using key value based databases such documents can be stored and accessed efficiently on the Cloud. The example document below is a Bill of Lading type of document, adapted from a UBL 2.0 specification in XML and translated to JSON. For convention, key names are in red, while values are shown in black font. For space reasons only part of the document is shown. { { document id : 6E09886B-DC6E-439F-82D1-7C83746352B1, { document type : Bill of Lading, { alternative terms : [ Master Bill, House Bill of lading ], { IssueDate : 2005-06-24, { IssueTime 14:20:00.0Z, { ConsignorParty : { OrganisationCode : 7D09886B-DB6E-539F-82D1-6D83746352C1, { PartyName : Consortia, { PostalAddress : { StreetName : Boston Road, { BuildingName : Suite M-102, { BuildingNumber : 630, { CityName : Billerica, { PostalZone : 01821, { Country : US, { Contact : { Name : Mrs Bouquet, { Telephone: +1 158 1233714, { Telefax : + 1 158 1233856, { ElectronicMail : bouquet@fpconsortial.com { FreightForwarderParty : { Shipment : { ID : CONS-0001, { GrossWeight : { unitcode : "KGM, value: 130,. { TariffDescription : Beeswax, other insect waxes and spermacetti, { TariffCode : 15219000, e-freight consortium P a g e 6 15June10

The above document can be edited collaboratively. Consignor party can enter their own information to the document. At the same time or at any time, the Freight Forwarder can add their own details in the relevant part of the document, and so on with the rest of the partners that share this document. Using appropriate mechanisms (called 'design documents') it is possible to prohibit different parties accessing or editing different parts of a document. For example you can prohibit the Freight Forwarder editing the part of the document that describes the Consignor. Because there is no schema controlling the structure of this document there is no restrictions as to the order that descriptions are entered into the document, nor as to the contents and structure of each section. In other words, the document complies with no particular schema or design. Authors can impose their own company's document schema if they like, some other standard schema (eg UBL) or no schema at all This gives significantly more flexibility in arranging the documents to dynamically suit the requirements of the participants of a particular transportation chain as it is established and operated. Advanced Document Processing Creating and storing documents is only part of the requirements for collaborative transportation chain management. (Sometimes) documents must reconsiliate differences in format and standards used by different participants. Changes to documents must be detected and notified to interested parties in the transportation chain. Finally other processing must be carried out to support operations of the transportation chain such as compliance reporting. These advanced features are explained below. Change Notifications Change notifications are important mechanisms to make all collaborators aware about changes to the shared documents or to their statuses There are several ways to get notifications about changes in a database: Polling: is a technique for applications (eg web browsers) to query the database for changes. Polling means constant (regular) querying the database for changes Long polling: This avoids the frequent requests to the database -but requires the establishment of an open connection to the database to send notifications when changes occur. Continuous changes is a technique for client programs to receive change notifications using a single HTTP connection. Filtering is used with the above change notification techniques to only receive notifications for documents that meet certain criteria. This is important because not everyone is interested in all changes that occur in the documents. For example the Shipper is interested only to notifications about the document's delivery status updated to 'delivered' e-freight consortium P a g e 7 15June10

Dealing with different schemas and formats Using the approach described below, different collaborators in a transportation chain can use different standards to define various properties of documents, while document allows mix and match of standards. By default, the standards and conventions employed by the document's author company are assumed when collaborating on a business document. So for example if the author is using the UBL 2.0 standard, all attributes ('keys') and values in the document are assumed to comply with that standard. If a document contributor however wants to specify their own standard they need to qualify the type of the keys using a namespace/urn description as in the example below. The property description below states that the document uses the term TotalGrossWeight according to Brad/GS1 specification { TotalGrossWeight : { type : urn:eanu.ucc:2, value : 130 The more complicated example below, states that the net weight of the shipment is 110 KGR using the standard abbreviation code KGR to mean kilograms. { NetWeight : { value : { type : urn:un:unece:uncefact:codelist:specification:66411#kgr", 110 The example below shows how different parties can use different standards to add descriptions inside the same document. Thus for example the Consignor can use UBL 2.0 standard to define the net weight of the consignment, while the freight forwarder can use BRAD GS1 to state the gross weight of the consignment. { document id: : 6E09886B-DC6E-439F-82D1-7C83746352B1, { type : shipment,... { NetWeight : { value : { type : urn:un:unece:uncefact:codelist:specification:66411#kgr", 110, { TotalGrossWeight : { type : urn:eanu.ucc:2, value : 130... If a third party wants the gross weight expressed in UBL 2.0 format, then a view can be used to convert the BRAD GS1 data to UBL compliant data. So this is in summary how we avoid semantic ambiguity in a shared document. If the property of the document lacks any type qualifiers we assume that the types used by the author of the document apply. If the property has a type qualification (e.g a URN or URI) we know that the type of the property is defined by that URI, not by the global type/namespace. By using views we can convert between different data types. For example if the weight of individual consignments is in pounds and we want the weight of the total consignment in kilos, e-freight consortium P a g e 8 15June10

we can define a view that converts the weights of the consignments to kilos and then sums them up. Checking the document for omissions and computing derived values We can scan the document for the existence of key/value pairs to check for missing or even incorrect properties. We can scan for example for the existence of key, value pairs that define the gross weight of the cargo. Ontologies/thesauruses can be employed to understand the meaning of the keys used in the document. With ontologies we can also infer properties of the document and compute new values. Using logistics ontologies it is possible to infer and automatically compute data from a document. For example the following view can compute the hazardous risk indicator for the shipment. If TariffDescription = Peroxide Then TariffCode = 15219000 And HazardousRiskIndicator = true Integration with Other systems Obviously, we do not expect everyone in the freight business to dump their databases and other infrastructures straight away and start using a lightweight document based Web DMS like the one proposed here. In fact, many line of business systems use DBMS and similar middleware that are highly optimised and do a perfectly good job. However what is optimised IT for one company is not necessarily optimal for a transportation chain. Fortunately, the approach described here is designed to interoperate with existing IT systems using universal (HTTP based) protocols. It is fairly straightforward to extract relational data from a DBMS and save to the Web databases described here using a combination of SQL and fairly lightrweight processing to transform rowsets to JSON structures. Updating a relational database with JSON structures is also straightforward. These operations are performed individually by different transport network participants without having to share their data or processing with others. The responsibility for synchronising between the relational DBMS and the online database is also with the user. It is obviously to the interest of everyone that the internal systems of the participants synchronise as often as possible with the shared online databases. The notification mechanisms described above make fairly simple to receive updates from the online database. These can in turn be used to update data in the relational DBMS. A more ideal option is for users to adopt the collaborative DB internally and use it for their internal systems. The benefits of this approach is the reduced effort to keep two separate systems in sync. Additional benefits include that Web based applications to manipulate the database can more easily written and integrated with the database. e-freight consortium P a g e 9 15June10

The users are under total control as to which parts of their data stay internal and which can be shared with the other partners. This can be achieved using the authorisation and authentication mechanisms described earlier on. Summary and benefits of the proposed approach Intensive research from companies like Google about online sharing and distribution of large volumes of data have resulted in many useful techniques. E-freight is about online data sharing, distribution and replication. These are all the data that are created during the planning and execution of transport chains. Internal systems such as RDBMS might be effective and efficient for handling internal enterprise data but are not always as effective for sharing Web based data. The approach described here has many benefits including: No need to enforce/agree schema across all partners, for example new document types with new meaning can be safely added alongside the old. Although a worthwhile goal, experience has proved that It can be very difficult to arrive at a global document schema for e-freight or any type of e-business Makes it efficient to transmit fragments of a document rather than a whole document when there are communication constraints eg bandwidth and cost, such as between a ship with a slow communication link and a shore system. If you store a document in a single location to be accessed by anyone who needs it, that location becomes a single point of failure. It is better to replicate the document on a cloud storage so whoever needs it is guaranteed availability and can easily access it from some accessible storage node. Moreover to access the document no special systems and interfaces should be needed, apart from a browser with HTTP/HTTPS connectivity. e-freight consortium P a g e 10 15June10