HALOGEN. Technical Design Specification. Version 2.0

HALOGEN Technical Design Specification Version 2.0 10th August 2010 1

Document Revision History Date Author Revision Description 27/7/09 D Carter, Mark Widdowson, Stuart Poulton, Lex Comber 1.1 First draft following review by IT Services team. 10/8/09 D Carter 2.0 Issued to Project Board. Approvals Date Name Title Embedded signature/email 2

1. Purpose of Document The purpose of this paper is to document the design of the IT related infrastructure that will be used to support users of the HALOGEN system. This paper begins by describing the general approach and principles that will be applied to the implementation phase of the project. Subsequent sections describe a generic model that can be used to support the design of database management systems and their related infrastructure; reviews some of the key data quality issues associated with the pilot datasets and finally identifies specific software tools and hardware that will be deployed by the University of Leicester for HALOGEN. 2. Implementation Approach and Design Principles 2.1 Implementation Approach The high level user and service requirements have been agreed with the research users and documented in the HALOGEN Service Specification (http://www2.le.ac.uk/offices/itservices/resources/cs/pso/halogen/key-documents). The implementation approach for the delivering a system to support these requirements will be to develop two prototypes during the implementation phase of the pilot project. These are briefly described below. Prototype 1 The first prototype will be structured around using the ArcGIS software package to directly process and analyse the pilot datasets. As part of this phase of work procedures to clean and format the raw data in the pilot datasets will be developed, and ArcGIS specific processing and visualisation requirements will be defined and delivered. This will give the research team a capability to begin to analyse new sources of research data and further formulate their thinking on the research questions they wish to address and the techniques that will be of most value to them. There is no lead time for providing the infrastructure to support the development of the initial prototype. The data storage requirements can be met through use of currently spare capacity on the existing Research Storage infrastructure managed by IT Services; the University already has the required ArcGIS software licenses and standard desktop workstations can support the processing associated with the visualisation and analyses of data. The specialist ArcGIS resource required to undertake the majority of the work for this phase has been secured and is available from mid-august. Prototype 2 The second prototype will build on the initial system and introduce a central database which will be used as a repository to hold research data in a structured and standardised way. This will provide a database management system that will support the longer term aims of HALOGEN researchers to increase the number and size of the external data sources input into the system and to be able to interrogate and extract information for analyses using a variety of different analyses tools not just ArcGIS. There is no implied dependency between the two prototypes and, resource permitting, the development of the second prototype can progress in parallel to the first. 3

There will be a need to purchase dedicated server equipment to host the central database. In order not to introduce any delay it is proposed to use existing server capacity to provide a virtual server development environment for the early phases of the implementation project. At this stage it is proposed to use database and data management tools that are available free to the research community. The resource required to complete the development of this prototype will be predominantly IT Services staff with input from specialist GIS resource around database design issues as necessary. 2.2 Design Principles The following high level design principles have been assumed when making decisions on the database and infrastructure for HALOGEN. As this is currently a non-business critical service for the University and, from a research perspective, a pilot project, the level of resilience and redundancy built into the design is low. In order to limit cost the team have selected database and software products which are either free to use for academic staff of where the University already has site license agreements in place. The working assumption has been that the pilot will be successful and that additional funds will be found to extend the service. The infrastructure design proposed can be scaled up at extra cost to support both the storage and processing capacity growth projections in the Service Specification. Any scripts to clean, load or extract data will be developed in such a way that they can be operated without the aid of specialist IT staff by research users. IT Services will be responsible for the support and maintenance of all infrastructure components deployed to support the project. 4

3. Database Management Model The generic database management model on the following page identifies that for database requirements like those of HALOGEN there are five stages or processes that need to be considered. These are described briefly below. Source Data - This involves obtaining and storing the source data that will be used as input to the research database. Included within this are procedures to refresh data as new versions become available. Extract, Transform and Load (ETL) This involves procedures to extract from the source data those items which are of relevance to the research group. In many cases the raw data may need to be cleaned and formatted in some way in order to improve its integrity or make it compatible with other source data sets. For example, it may be useful to introduce common codes for regions or check that the entries in specific fields are complete and accurate. Finally, there needs to be a way of loading the required data into some type of database. Data Storage This involves procedures to store and manage the core data required by researchers. For example, procedures and policies will need to be defined for governing the backup, recovery and access of data. Data Analyses This involves applying various tools and techniques to the data to produce information. In some cases it may involve extracting selected information for analyses using tools like SAS, SPSS or R. Visualisation This is one specific form of data analyses. From a HALOGEN perspective the geographical visualisation of data is the key user requirement and so it is appropriate to consider it as a separate stage. The diagram also identifies some of the many software tools that could be used to support different stages of the model. Each institution needs to choose those products which best suit their requirements and the skill sets of those involved in any project. 5

Database Management Software A key design decision is which database software to use. From a University of Leicester perspective three database products were considered as suitable for a database with the potential capacity and complexity requirements of HALOGEN. Initial discussions with researchers suggest requirements for spatial querying, data mining and full text search/manipulation. The three options considered were Oracle, SQL Server and MySQL. The key issues relating to each are summarised below. Oracle Oracle 9i (or higher) Enterprise Edition required with both spatial querying and data mining options added. Licensing costs for Oracle are high and combined with the limited skills and resources available for Oracle make this option prohibitive for the University of Leicester given the alternatives available. SQL Server SQL Server 2008 Standard Edition is required. Only the 2008 version of the software offers the spatial data querying option. Data mining and full text indexing options are also included. The Research Computer Services team do not currently have the relevant skills to support this software and, whilst the relevant skills do exist elsewhere in IT Services, these staff are already over committed to project work. MySQL MySQL offers support for both spatial data querying and full text indexing. Data mining is offered by 3 rd party open source provider Pentaho. There are open source versions of the software which are available at no costs and deemed suitable for the pilot phase of the project. The relevant skills and resource is available in the Research Computer Services team. Based on the above the team has chosen MySQL as the database software for HALOGEN. The software tools that we currently plan to use to support each stage of the database management model are summarised below for both of the prototypes. Prototype 1 Source Data & Extract, Transform, Load Excel, Access, PERL scripts Data Storage ArcGIS is based on a database and has utilities to assist with metadata management. Data Analyses ArcGIS has the ability to conduct many types of standard statistical analyses and to output files in a format that can be used by packages like SPSS or R. Visualisation - ArcGIS Prototype 2 Source Data & Extract, Transform, Load Excel, Access, PERL and/or Python scripts Data Storage MySQL (for some types of data, e.g. large images, it may be more appropriate to store these outside of the database via a Filesystem and store relevant metadata in MySQL) Data Analyses Database extracts/queries will provide output for use in tools favoured by researchers. For example, SPSS, R and ArcGIS. Visualisation ArcGIS 7

4. Data Quality Issues with Source Data Sets As part of the design phase of the project an initial review of the quality of the pilot datasets was performed by GIS specialists in Geography. The intention was to highlight any major issues or problems that could impact the use of ArcGIS. The findings are presented below for the Portable Antiquities Scheme (PAS) which now includes the Fitzwilliam coins database, the Key to English Place Names (KEN) and University of Leicester Genetics data. All the sample datasets have some form of georeference PAS: easting and northing fields although 10% are empty Genetics: population field which generally describes the county of the record English place names: gridref fields with components of a an OS grid PAS dataset is generally fine with good georeferencing, except for the 10% of the records with empty easting and northing fields. Suggested improvements: - fill in any blank values (e.g. with a 0, etc) - assess if all the fields are needed in the file to be imported into ArcGIS Genetics data is in a strange format and will take extensive manipulation to get in a format to be imported into ArcGIS and there will still be problems. This data can be mapped as it stands with some manipulations although only to a county area centroid. Suggested improvements: - Have only one header row - Export the file as a csv from the dbf export > text file myfile.csv access should do the rest, including management of blank fields. If this cannot be done some simple grep / replace could be run on the data. - Assess how to manage not typed this is a numeric field and having non-numeric characters entered will confuse ArcGIS - Typed descriptions need to be avoided so develop and use a key code to ensure consistency - Introduce separate fields for different things: e.g. Father from Australia vs. Father from Australia Mother from New Zealand. The latter needs to in two separate columns for ease of analyses. There is a requirement to add an alternative geographical location to each record in the Genetic database. This will reference the centre of gravity of the surname of the male whose Y chromosome was analysed. The geographical coding for these locations will be systematic and should not cause problems, but some surnames (e.g. common ones like 'Smith') will lack a location. KEN data English place names provided in.mdb format with some.csv files. This was generally of reasonable quality, not too much formatting will be needed to get the data into ArcGIS. However the spatial referencing is coarse and some manipulation is needed to concatenate three fields in order to provide a 1km OS grid reference (e.g. SK, 75, 48 SK7548) which could be linked to an easting and northing via data 8

downloaded from Edina.ac.uk. This data can be mapped after the concatenated grid reference elements have been linked to OS1km grid data to give the easting and northing values. Suggested improvements: - Provide data in csv format - Provide complete 6 digit referencing not 4 or 5 digit - Consider more detailed referencing - Concatenate current referencing if possible, although this is not essential (Note that only 191 / 313 records have this georeference i.e. ~40% do not) - Link to gazetteer to improve data quality. This can be done but it will need other data to identify the correct gazetteer record. - Avoid control characters in the free text (etymology) as this can create problems when the data is read into ArcGIS There are some general guidelines for importing data to ArcGIS which will be followed to simplify data preparation activities. These are: -.csv or excel (.xls) format -.mdb format but not.accdb - no non-alphanumeric characters in field names, although underscores are ok - do not start the field names with a number - field names should be <12 characters in length - have one type of data in each field e.g. characters or numbers (characters can include numbers but they cannot be manipulated numerically) - do not have too many (i.e. 100s) spurious fields some of the data may not load - do not have sparse fields ArcGIS will make a guess (often a bad one) at what type of field it is (e.g. text, etc) so complete with 0, N/A, -9999 etc - avoid labelling identifiers / referents as ID myid is better - use a standard set of descriptors in any field if possible rather than free text. If a predefinable set of free text descriptions are to be entered and this is being done manually then I suggest a code is used to avoid typos. In summary, the most problematic data set is the Genetics data. Considerable processing will be needed to clean and standardise the data prior to analyses. As this data is internal to the University of Leicester the team are comfortable that this is achievable. Overall the team believe that all the issues identified to date can be overcome as part of the pilot project. 5. Infrastructure & Implementation Resource To support the pilot phase of the project IT Services will need to purchase, install and configure appropriate database server and storage. These items are outlined below. Database Server The intention is to buy a production server with the following specification: 12 core x86_64 64GB System Ram Redundant PSU RAID 1 System disks RAID 0+1 DB Storage 800GB 3yr On-site next business day support 9

The operating system will be Linux The cost is estimated as 15,000 inclusive of VAT. Storage To support data storage and back up requirements it is proposed to buy additional capacity on the Research Data Storage system managed by IT Services. HALOGEN will be provided with 2 Terabytes of usable primary storage capacity and a further 2 Terabytes of backup storage capacity. Data will be backed up to a different data centre than that which hosts the database server on a daily basis. The recovery of data will be managed by the Research Computer Services team on receipt of a support request. The cost of this will be 2,400 inclusive of VAT. Software The free version of the open source MySQL database software will be used for the pilot project. The use of ArcGIS by HALOGEN is covered by a site license and is therefore free. It is possible that additional software products may be identified later in the project. User Access Users will be able to access the infrastructure from Windows (CIFS), Linux (NFS) and MAC (CFS) workstations. 10

Resource Estimates To develop the two prototypes will require the following resource. Geography ArcGIS expertise, data clean up and visualisation: 40 days Research Computer Services infrastructure build and database configuration: 10-20 days Database Management Services database design and consultancy: 10-15 days 11