Epimorphics Linked Data Publishing Platform

Epimorphics Linked Data Publishing Platform Epimorphics Services for G-Cloud Version 1.2 15 th December 2014 Authors: Contributors: Review: Andy Seaborne, Martin Merry Dave Reynolds Epimorphics Ltd, 2013

1 Overview The Epimorphics Linked Data Publishing Platform is a resilient, scalable, cloud-based solution for publishing linked data. It is widely used for publishing linked data on data.gov.uk, including data at environment.data.gov.uk, location.data.gov.uk, landregistry.data.gov.uk, and many others. We offer the platform as a fully hosted and managed service for publishing linked data; in addition we can install the platform on a client s own infrastructure. The prices quoted in this document assume we are providing a hosted service on top of Amazon Web Services. An instance of the platform runs on a cluster of dedicated machines for each client. The platform includes A Linked Data API engine, providing access to the data in a number of developer-friendly formats as well as human-readable web pages Customisable text search A triple store for storing data as RDF A fully SPARQL 1.1-compliant endpoint A scale-out, fault tolerant runtime platform An upload manager, to enable clients to load their own data Optionally we can provide additional upload mechanisms which will integrate with clients existing workflows to support business as usual publication of linked data. The platform is customisable and can also be used to host applications running on top of the data. We offer consultancy and application development services to support the development of such applications; see our G-Cloud services Linked Data Modelling and Consultancy and Linked Data Application Development for further details. We also offer training courses for people wishing to develop their own skills in linked data publishing see our G-Cloud service Linked Data Training. In addition to the full platform we also offer an entry-level system for people wishing to start linked data publication. The entry level platform runs on a single dedicated machine, so is neither faulttolerant nor scalable, and will need to be taken out of service during scheduled maintenance. It only provides limited management information. Our hosting service includes support during UK business hours. 1

2 Platform Details The Epimorphics Linked Data Platform is used by the Environment Agency, Land Registry as well as commercial customers for linked data publication. It consists of: A Linked Data API engine, provided by Epimorphics implementation of the LDA, ELDA Text search, provided by Apache Solr 4 A fully compliant SPARQL 1.1, provided by Apache Jena ARQ A scale-out, fault-tolerant runtime platform, hosted in Amazon Web Services An update controller for managing coordinated updates to replicated services The platform architecture has 3 main tiers: load balancing and routing, application services and storage. Linked Data API Engine Platform Architecture The Linked Data API is a specification commissioned by the Cabinet Office and co-developed by Epimorphics, to provide web developer-friendly access to linked data (http://code.google.com/p/linked-data-api/wiki/specification). It enables developers to consume linked data in a variety of formats without having to learn the details of SPARQL and RDF. Our platform uses ELDA, our own widely-used open source implementation of the Linked Data API. ELDA can also combine text search as an additional facility in defining web-developer APIs to access the data. ELDA optionally uses SPARQL 1.1 (particularly sub-queries) in order to improve responsiveness. 2

Text search The text search indexing is provided by Apache Solr. This can be accessed via the Linked Data AIP or directly within SPARQL queries: The indexed data model is based on the conceptual entities within the data, rather than raw indexing of triples. SPARQL 1.1 Engine Our platform is based on Apache Jena, including TDB and Fuseki. This includes the ARQ query engine, which passes the complete SPARQL 1.1 test suite for query, update and protocol. In addition, the engine is capable of combining free text search with SPARQL queries. Runtime Platform The runtime platform can be deployed within a number of different cloud service providers, as well as on a client s own infrastructure. In this document we assume that the deployment will be within AWS. It achieves scalability and fault-tolerance by having a number of identical replicas across different AWS availability zones. Data is kept with the EU for data protection jurisdiction. The replicas are a combination of application services and a local copy of the SPARQL database and, separately, Solr text indexing. An Amazon load balancer tracks active nodes and routes traffic based on current load and availability of service machines. The number is adjustable to meet the expected load on the system and desired responsiveness within the available budget. The ELDA and SPARQL services reside on the same machine because the ELDA implementation uses the triple store for all its data. The text search may have different scalability requirements and is scaled independently of the triple store. The platform logs all incoming requests, including originating IP address, enabling clients to understand and mine the log information to determine usage patterns as desired. 3

Deployment View Update Controller Changes to the published data are performed by a secured controller. The controller is responsible for determining the necessary changes to the replicated triple store and replicated text index. The controller can be used both by user interface and by scripted processes. The controller also provides SPARQL Update for management of the triple stores, such as corrections to published data. The public interface exposed to the data consumer does not include the SPARQL Update service, which is only available via the secured controller. Entry level platform For the entry level system the runtime platform is limited to a single dedicated machine (there is no replication and no load balancing). There is no direct access to the logs of incoming requests. Apart from this the details are the same as those described under Runtime platform above. 4

3 Service Details As a hosted service, our platform is accredited to store and process IL0 information only. All data loaded onto the platform is backed up at the time the data is loaded, so the backup is always an accurate reflection of the data in the system. The replicated nature of the platform means that a hardware failure will not cause data from the running system to be lost. In the event of catastrophic infrastructure failure which takes all out the replicated instances the data will be restored from backup as quickly as possible. On-boarding: if no customisation of the web interfaces etc. is required, then we will provide the client access to the upload manager so that they are able to have data loaded onto and published by the platform within 5 business days after contracts have been signed. We can provide expedited onboarding at extra cost if desired. Off-boarding: no user data is collected by the system the only data stored on the publishing platform is data supplied by the client. On termination of the contract all client data will be securely deleted. During the life of the contract clients can request access to a copy of all the data stored on the system. As the system is fully replicated routine maintenance can be carried out without taking the system off-line; there is no need for scheduled maintenance windows when the system is out of service. We aim for the availability of the system to be 100%. Details of our support services are given in the next section. We do not offer a trial service, though we do offer an entry level offering for fewer than 10M triples see our separate pricing document for details. 5

4 Support Our hosting support for the full system includes all regular maintenance, monitoring and backups. We will provide reports to the clients on the usage of the system the precise details of the data reported will be agreed with the client during the setup phase. We also provide an incident reporting service. The basic service is available during normal business hours (09.00 17.30 Mondays Fridays, excluding public holidays). We provide an email address for incident reporting and will respond to any notification within 4 hours. If an incident results in loss of service we will restore the service within 1 business day; in all other cases we will use reasonable efforts to resolve the incident as quickly as possible. Additional support options are available at extra cost, including telephone support and faster response times. For such additional support services we offer service credits in the event of failing to meet targets. We note that the replicated nature of our architecture is such that we do not need to take the system down in order to perform regular maintenance and system updates. The production system we run for the Environment Agency went live in April 2012 and since then has been available for 100% of the time. For the entry-level system, running on a single dedicated machine, we will still provide an email address for incident reporting and will respond to any notification within 4 hours; however, if an incident results in a loss of service we will use reasonable efforts to restore the service as quickly as possible, but will not offer a guarantee that we will restore the service within 1 business day. 5 Use of Open Source Software Our platform is based on open source software, notably Apache Jena, including ARQ, TDB and Fuseki Apache Web Server Apache SOLR ELDA, Epimorphics open source implementation of the Linked Data API Apache Tomcat Apache Lucene 6 Compliance with Open Standards 6

Linked data is crucially dependent on the correct implementation of the relevant open standards. Our platform is fully compliant with all the relevant standards, notably RDF syntaxes: RDF/XML, Turtle, N-Triples RDF 1.1 Turtle SPARQL 1.1 Query SPARQL 1.1 result set formats (XML, JSON, CSV, TSV) SPARQL 1.1 Update 7