Data Appliance Sailing to Data Islands By Simon Ellwood-Thompson Chief Technical Officer: SAIL DataBank &Health Informatics Research Unit, Swansea University
SAIL Databank Swansea, WALES WALES most beautiful part of the United Kingdom (3m people, 11m sheep)
SAIL Databank Swansea, WALES Following the PechaKucha just some clarity:- WALES and Scotland are the most interesting
SAIL Databank Swansea, WALES Following the PechaKucha just some clarity:- WALES and Scotland are the most interesting
SAIL DATABANK Recent Developments Medical Research Council (MRC) - Centre of Excellent SAIL DATABANK major asset Wales Scotland Manchester UCL London CIPHER - one of the four co-ordinating centres of the Farr Institute Economic and Social Research Council (ESRC) Wales Scotland Southampton (England) North Ireland CADRE one of four Administrative Data Research Centres (ADRCs) Bio-Informatics award Large compute cluster for Genetic Research
FARR @ Swansea Capital Investment Additional capital investment, Our GOALS:- 1. UKSeRP: Offer our an expanded version of infrastructure as a service (IaS) to other major programmes (none-sail) 2. Data Appliance: Provide local capabilities to manage datasets so that dataset discover and availability become easier 3. Natural Language Processing Context: Large amount of automation already developed but predicted massive increase in workload without increase on staffing
FARR UKSeRP (quick overview) Existing infrastructure large IBM DB2 data warehouse (Database and management/processing code) Remote access technology SAIL Gateway, based on Vmware View Policies and procedures Hosting, power, cooling, IT staff to support infrastructure Expand Technical Platform Double SAIL Gateway and increase power of each desktop Add software e.g. SAS & BI tools Add SQL server 2012 3 node AG Cluster Add HADOOP cluster big data HyperV clusters Additional 10 racks DB2 Head Backup IBM DB2 Warehouse SAIL VDI SQL 2012 Availability Group NLP HyperV HADOOP NODE1 NODE2 SAN 24TB NODE3 NODE4 SAN 24TB VDI VDI SAN 60TB VDI VDI VDI VDI SAN 60TB SQL 2012 Ent DAS 21TB SQL 2012 Ent DAS 21TB SQL 2012 Ent V. Tape SAN DAS 21TB NLP NLP SAN 21TB HyperV HyperV HyperV HyperV SAN 100TB Hadoop Hadoop Hadoop Hadoop Additional management requirements Now three database platforms Selective delegated management and control Multiple configuration and security models
FARR Data Appliance Goal : Development of hardware and software appliance for deployment into the NHS, Local Government and within SAIL to provide dataset collection, management, documentation and local linkage. These appliances will bring the capabilities previously only found in large data linkage center to the organisation in which they are deployed A key outcome is to make documented dataset visible to the wider research community for include into national projects, subject to information governance approval. These units are designed to be as low a unit cost as possible. Funded to provide 15 Appliances to NHS/Goverment free of charge What's the point provide a carrot not a stick Give business benefit to an organisation to allow them to create and management datasets Provide locally linked dataset Create identifiable and anonymised view for these staff Provide documentation and validation of dataset = Discover & Make dataset research ready
Data Appliance simplistic viewpoint Web Based Application Empower end user to create and manage datasets. No database expertise required Web Front End FTP / ETL DATASET DATASET DATASET Access Control Access Control Access Control Data storage Data storage Data storage Documentation Documentation Documentation Schema Editor Schema Editor Schema ER Editor Diagram ER Diagram ER Diagram Metrics and Metrics and Metrics Validation and Validation Validation Artefacts / Files Artefacts / Files Artefacts / Files Lowering the technical bar. Security, Configuration & Capability Model
Data schema automatically computed based on data contained in uploaded file
Publish based on permissions, configuration & capabilities Web Front End FTP / ETL DATASET DATASET DATASET Access Control Access Control Access Control Data storage Data storage Data storage Documentation Documentation Documentation Schema Editor Schema Editor Schema ER Editor Diagram ER Diagram ER Diagram Metrics and Metrics and Metrics Validation and Validation Validation Artefacts / Files Artefacts / Files Artefacts / Files Publishing (File Splitter) Local Data Catalogue Data Quality and Metrics Sharing & IG Linkage & Matching Database Loader MS SQL PostgreSQL External Regional / Global Data Catalogue Other Appliance Trusted Third Party Linkage & Matching IBM DB2 Security, Configuration & Capability Model MS SQL PostgreSQL HADOOP UKSeRP Key deliverable: Permission / Configuration / Capabilities
Publish Dataset Depend on Configuration/Capabilities. Data will now be available
Data Catalogue Key Component Additional points following previous sessions: All DA carry a DC, DS can inherit from other DS DC entries, DC related to Programme/Security domain. DC s replicate to Regional/Global DC. Road map: DC used to define and create DS
A Dataset Contact Specific version & Date Request VIMO All section attach files Theme / Type / Level Tags
A Dataset (cont.) DDI, SPSS, SAS, STATA
Data Catalogue a specific table
Data Appliance very modular and configurable Physical Server running a set of virtualised servers configured and scaled appropriately for the environment. Architecture is based on loosely coupled async message passing between code blocks (Presented at SHIP 2013) UKSeRP Presentation Data Appliance Presentation
3 initial configurations plug and play single cable Small (Development / Demo) Single servers everything on. 4 cores, 6gb Web site, Workflow engine, Modules, SQL Express, MongoDB, RabbitMQ Medium (Single Physical Server) Single HyperV server, multiple v-servers for different roles. Dual 10 core CPU (40 virtual cores), 160GB memory, 6TB Disk SQL Express replaced by SQL server Standard 2012 10 special versions having extra modules for CliniThink NLP Large (Four Physical Clustered Servers) Dual Server HyperV server, Dual 10 core CPU (40 virtual cores), 160GB memory, shared 24TB Disk Dual SQL server 2012 Enterprise, Dual 8 core, 96GB memory, 22 x 300gb local disk, SQL 2012 Server AG Cluster Software : Custom software in C#.net 4 / MVC 4. RabbitMQ, MongoDB, RavenDB for Large version, GoodSync FTP Replication Costs for medium version : Hardware plus licencing for Microsoft Windows Server 2012 Enterprise & Microsoft SQL server 2012 standard
The Appliance is a disruptive technology GAME CHANGER Challenge: fit everything that a large data linkage center does into a single shrink wrapper product Opportunity: Look back on what we have done and question the design, unique opportunity for reflection and rare in successful operational systems
Challenges UKSeRP: additional database systems need to support Microsoft SQL server, PostgreSQL, Cloudera HADOOP as well as IBM DB2 Warehouse Opportunity to make these system agnostic : remove vender tie in allowing for options in the future Need a probabilistic matching engine to do data linkage Opportunity: our existing system is very slow and unable to be support very well by our trusted third party due to its age. Very gold standard bias Partnership with Curtin University, Australia allowing use to embed there system in the appliance and replace the trusted 3 rd party system. Additional benefit of increasing the capabilities of our matching beyond gold standard machining and looking forward to a continued partnership to explore Bloom Filter matching, Automated matching tuning, Dynamic recompilation of matching relationships based on project needs Special Thanks to James Boyd & James Semmens, Curtin University Replace our residential matching and anonymisation system to Experian AddressBase allowing integration with the matching engine and finer matching down to flats in multi occupancy residences
Linkage: Migrating from ALF to New ALF2 and RALF2 SAIL uses a trusted third party, additional benefit is to inclusion of process monitoring and remote reporting. End Users will be able to see where in the process there requests are much better user experience
Challenges Need delegable/devolved account management. Both appliance and UKSeRP Authorisation, Authorisation, Accounting Opportunity to develop a new security model which covers all aspects of the infrastructure not just the database allowing the model to be validated and used in many ways Modular and extensible provisioning system taking the model and applying the intension Event drive rather than time based, better service to the user Modular User activation :e.g. SAIL DAA, HR system lookup Ability to support multiple two factor authentication systems Linking into JANET MoonShot host organisation authentication (under development)
Challenges Dataset documentation is patchy and vague Structured Documentation is now mandatory for a dataset to be loaded into the appliance. Dataset can only be loaded into SAIL using the appliance. Ability to attach artefacts (supporting documents) to a dataset Ability to load data as reference / lookup data within a dataset Opportunity: Partnership with Manitoba Centre for Health Policy, Canada Special thanks to Mark Smith. Bring the automated data quality reporting to the appliance and turn the appliance back on the SAIL Databank to automatically collate and measure the quality metrics. Now have database system agnostic data quality module. Looking forward to a continued partnership to look at measuring not just variable quality but relationships between variables and other datasets as well as creating a pluggable architecture to do dataset specific statistical analysis Opportunity: Create an automatic Data Catalogue based on the datasets documentation, computed metrics and validation rules. Link into IHDLN Working Group on Metadata
Challenges Data movement and routing both data and metadata Automated splitting of data files Trusted third parties Data refreshes Organisations outside NHS Sharing / Subscriptions
Why such a disruptive technology (6 months to build!!) The system was fine and everybody happy As we started designing became obvious how the system should be reconfigured with the data appliance as the central component. SAIL DATABANK IBM DB2 Remote Access Security / IG SAIL Technical Additional Data Over simplified representation of SAIL Databank
Why such a disruptive technology (6 months to build!!) SAIL Databank has/will/is becoming an instance of UKSeRP and fully dependant on Data Appliance Remote Access / VDI Other Major Programme SAIL DATABANK Security / IG Data Appliance BI / Compute NLP HADOOP MS SQL IBM DB2 Data Management and Loading Additional Data File Splitting & Separataion Versioning Auto Documentor Data Metrics and Quality reporting Data Catalogue Data transportation
Why such a disruptive technology (6 months to build!!) Deployment of SAIL Databank Satellites / SAIL Mini 7 NHS trusts of Wales 4 NHS trusts of Bristol (England) 1 NHS Trust of North Devon Major upgrades to our trusted third party (NWIS) SAIL Databank major technology upgrade ADRC none health focused
Data Appliance Simon Ellwood-Thompson SWANSEA UNIVERSITY SIMON@CHI.SWAN.AC.UK