Data Appliance Sailing to Data Islands



Similar documents
Data platforms to support research, evaluation & practice. David V Ford Professor of Health Informatics School of Medicine, Swansea University

Project Assured Data Access. Henry Hughes

Big Data for health. Farr Institute, Administrative Data Research Centres, Medical Bioinformatics. 9 July Jacky Pallas, UCL

Big Data and the social sciences a perspective from the ESRC. Peter Elias

Secure networking and AAAI

SQL Server Virtualization 101. David Klee, Group Principal and Practice Lead. SQL PASS Virtualization VC,

Boas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, IBM Corporation

Information Services hosted services and costs

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Manufacturer Saves $1.5 Million, Caps IT Costs by Using Hyper-V Technology

Bridging the gap between local IT and Cloud services, keeping you in control

How Customers Are Cutting Costs and Building Value with Microsoft Virtualization

Bridging the gap between local IT and Cloud services, keeping you in control

James Serra Sr BI Architect

Server Virtualization with VMWare

The Future of Data Management

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Bridging the gap between local IT and Cloud services, keeping you in control

Bridging the gap between local IT and Cloud services, keeping you in control

Oracle Big Data SQL Technical Update

Keystones for supporting collaborative research using multiple data sets in the medical and bio-sciences

Choices for implementing SMB 3 on non Windows Servers Dilip Naik HvNAS Pty Ltd Australians good at NAS protocols!

SQL Server PDW. Artur Vieira Premier Field Engineer

The School IT Challenge. Introducing Systemax Stack As A Service. Top 12 School IT Challenges

Australian Paper knows green is good for business with IBM

Cloud economics and flexibility with local choice and control

Unified Computing Systems

Maximize Your Virtual Environment Investment with EMC Avamar. Rob Emsley Senior Director, Product Marketing

Designing a Microsoft SQL Server 2005 Infrastructure

Windows Server 2012 Hyper-V Installation and Configuration Guide

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Parallels Plesk Automation

Healthcare Firm Gains More Efficiency, Cuts Costs with Private Cloud Environment

Software Defined Storage Needs A Platform

Virtualization Support - Real Backups of Virtual Environments

ACME Enterprises IT Infrastructure Assessment

FAQ. NetApp MAT4Shift. March 2015

SPEED your path to virtualization.

Configuring and Deploying a Private Cloud

Why is the V3 appliance so effective as a physical desktop replacement?

Virtual Fax Server Solutions. White Paper March 2010

Hadoop & SAS Data Loader for Hadoop

WHITE PAPER RUN VDI IN THE CLOUD WITH PANZURA SKYBRIDGE

Symantec NetBackup 5220

Course Outline. Module 1: Introduction to Data Warehousing

70-413: Designing and Implementing a Server Infrastructure

Hardware/Software Guidelines

Whitepaper. Vertex VDI. Tangent, Inc.

PASS4TEST 専 門 IT 認 証 試 験 問 題 集 提 供 者

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

Building a BI Solution in the Cloud

Cloud Optimize Your IT

Frequently Asked Questions: Desktone s Offerings and Market

70-417: Upgrading Your Skills to MCSA Windows Server 2012

IT Firm Virtualizes Databases: Trims Servers 85 Percent, Ups Performance 50 Percent

Scotland s Digital Future: Scottish Public Sector Data Centre Virtualisation Guidance

Data Protection with IBM TotalStorage NAS and NSI Double- Take Data Replication Software

10231B: Designing a Microsoft SharePoint 2010 Infrastructure

Terms of Reference Microsoft Exchange and Domain Controller/ AD implementation

Turbo Charge Your Data Protection Strategy

Module: Sharepoint Administrator

MS 20467: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

System Requirements and Prerequisites

Microsoft Analytics Platform System. Solution Brief

Symantec Endpoint Protection 11.0 Architecture, Sizing, and Performance Recommendations

Exam : Administrating Windows Server 2012 R2. Course Overview

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Sage 200 On Premise. System Requirements and Prerequisites

Dell Software. Jiří Svatuška

Audience. At Course Completion. Prerequisites. Course Outline. Take This Training

BACKUP BEST PRACTICES FOR A XENSERVER & XENDESKTOP ENVIRONMENT

Green Migration from Oracle

IBM Storwize Rapid Application Storage solutions

What s New with VMware Virtual Infrastructure

Server Virtualization A Game-Changer For SMB Customers

TRANSFORMING DATA PROTECTION

Product Overview. UNIFIED COMPUTING Managed Hosting Compute Data Sheet

The safer, easier way to help you pass any IT exams. Exam : Storage Sales V2. Title : Version : Demo 1 / 5

Experience with Server Self Service Center (S3C)

IBM G-Cloud Microsoft Windows Active Directory as a Service

EMC Backup Solutions for Virtualized Environments

Information Technology White Paper

MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

MCSE 5053/ Designing a Messaging Infrastructure and High Availability Messaging Solution Using Microsoft Exchange Server 2007

Server Consolidation with SQL Server 2008

HP P4000 G2 LeftHand SAN Solutions

Cloud Computing, Virtualization & Green IT

BUSINESSES NEED TO MAXIMIZE PRODUCTIVITY, LOWER COSTS AND DECREASE RISKS EVERY DAY.

Software Pricing. Operating System

Managing Application Performance and Availability in a Virtual Environment

7/15/2011. Monitoring and Managing VDI. Monitoring a VDI Deployment. Veeam Monitor. Veeam Monitor

Data Centre. We are XMA. Call Visit Follow

CA ARCserve Family r15

Data Centers and Cloud Computing

Private cloud computing advances

Cost Savings Solutions for Year 5 True Ups

SAIL Address-level Export File Structure & Data Transfer

SAP Crystal Reports & SAP HANA: Integration & Roadmap Kenneth Li SAP SESSION CODE: 0401

SMART SCALE YOUR STORAGE - Object "Forever Live" Storage - Roberto Castelli EVP Sales & Marketing BCLOUD

Transcription:

Data Appliance Sailing to Data Islands By Simon Ellwood-Thompson Chief Technical Officer: SAIL DataBank &Health Informatics Research Unit, Swansea University

SAIL Databank Swansea, WALES WALES most beautiful part of the United Kingdom (3m people, 11m sheep)

SAIL Databank Swansea, WALES Following the PechaKucha just some clarity:- WALES and Scotland are the most interesting

SAIL Databank Swansea, WALES Following the PechaKucha just some clarity:- WALES and Scotland are the most interesting

SAIL DATABANK Recent Developments Medical Research Council (MRC) - Centre of Excellent SAIL DATABANK major asset Wales Scotland Manchester UCL London CIPHER - one of the four co-ordinating centres of the Farr Institute Economic and Social Research Council (ESRC) Wales Scotland Southampton (England) North Ireland CADRE one of four Administrative Data Research Centres (ADRCs) Bio-Informatics award Large compute cluster for Genetic Research

FARR @ Swansea Capital Investment Additional capital investment, Our GOALS:- 1. UKSeRP: Offer our an expanded version of infrastructure as a service (IaS) to other major programmes (none-sail) 2. Data Appliance: Provide local capabilities to manage datasets so that dataset discover and availability become easier 3. Natural Language Processing Context: Large amount of automation already developed but predicted massive increase in workload without increase on staffing

FARR UKSeRP (quick overview) Existing infrastructure large IBM DB2 data warehouse (Database and management/processing code) Remote access technology SAIL Gateway, based on Vmware View Policies and procedures Hosting, power, cooling, IT staff to support infrastructure Expand Technical Platform Double SAIL Gateway and increase power of each desktop Add software e.g. SAS & BI tools Add SQL server 2012 3 node AG Cluster Add HADOOP cluster big data HyperV clusters Additional 10 racks DB2 Head Backup IBM DB2 Warehouse SAIL VDI SQL 2012 Availability Group NLP HyperV HADOOP NODE1 NODE2 SAN 24TB NODE3 NODE4 SAN 24TB VDI VDI SAN 60TB VDI VDI VDI VDI SAN 60TB SQL 2012 Ent DAS 21TB SQL 2012 Ent DAS 21TB SQL 2012 Ent V. Tape SAN DAS 21TB NLP NLP SAN 21TB HyperV HyperV HyperV HyperV SAN 100TB Hadoop Hadoop Hadoop Hadoop Additional management requirements Now three database platforms Selective delegated management and control Multiple configuration and security models

FARR Data Appliance Goal : Development of hardware and software appliance for deployment into the NHS, Local Government and within SAIL to provide dataset collection, management, documentation and local linkage. These appliances will bring the capabilities previously only found in large data linkage center to the organisation in which they are deployed A key outcome is to make documented dataset visible to the wider research community for include into national projects, subject to information governance approval. These units are designed to be as low a unit cost as possible. Funded to provide 15 Appliances to NHS/Goverment free of charge What's the point provide a carrot not a stick Give business benefit to an organisation to allow them to create and management datasets Provide locally linked dataset Create identifiable and anonymised view for these staff Provide documentation and validation of dataset = Discover & Make dataset research ready

Data Appliance simplistic viewpoint Web Based Application Empower end user to create and manage datasets. No database expertise required Web Front End FTP / ETL DATASET DATASET DATASET Access Control Access Control Access Control Data storage Data storage Data storage Documentation Documentation Documentation Schema Editor Schema Editor Schema ER Editor Diagram ER Diagram ER Diagram Metrics and Metrics and Metrics Validation and Validation Validation Artefacts / Files Artefacts / Files Artefacts / Files Lowering the technical bar. Security, Configuration & Capability Model

Data schema automatically computed based on data contained in uploaded file

Publish based on permissions, configuration & capabilities Web Front End FTP / ETL DATASET DATASET DATASET Access Control Access Control Access Control Data storage Data storage Data storage Documentation Documentation Documentation Schema Editor Schema Editor Schema ER Editor Diagram ER Diagram ER Diagram Metrics and Metrics and Metrics Validation and Validation Validation Artefacts / Files Artefacts / Files Artefacts / Files Publishing (File Splitter) Local Data Catalogue Data Quality and Metrics Sharing & IG Linkage & Matching Database Loader MS SQL PostgreSQL External Regional / Global Data Catalogue Other Appliance Trusted Third Party Linkage & Matching IBM DB2 Security, Configuration & Capability Model MS SQL PostgreSQL HADOOP UKSeRP Key deliverable: Permission / Configuration / Capabilities

Publish Dataset Depend on Configuration/Capabilities. Data will now be available

Data Catalogue Key Component Additional points following previous sessions: All DA carry a DC, DS can inherit from other DS DC entries, DC related to Programme/Security domain. DC s replicate to Regional/Global DC. Road map: DC used to define and create DS

A Dataset Contact Specific version & Date Request VIMO All section attach files Theme / Type / Level Tags

A Dataset (cont.) DDI, SPSS, SAS, STATA

Data Catalogue a specific table

Data Appliance very modular and configurable Physical Server running a set of virtualised servers configured and scaled appropriately for the environment. Architecture is based on loosely coupled async message passing between code blocks (Presented at SHIP 2013) UKSeRP Presentation Data Appliance Presentation

3 initial configurations plug and play single cable Small (Development / Demo) Single servers everything on. 4 cores, 6gb Web site, Workflow engine, Modules, SQL Express, MongoDB, RabbitMQ Medium (Single Physical Server) Single HyperV server, multiple v-servers for different roles. Dual 10 core CPU (40 virtual cores), 160GB memory, 6TB Disk SQL Express replaced by SQL server Standard 2012 10 special versions having extra modules for CliniThink NLP Large (Four Physical Clustered Servers) Dual Server HyperV server, Dual 10 core CPU (40 virtual cores), 160GB memory, shared 24TB Disk Dual SQL server 2012 Enterprise, Dual 8 core, 96GB memory, 22 x 300gb local disk, SQL 2012 Server AG Cluster Software : Custom software in C#.net 4 / MVC 4. RabbitMQ, MongoDB, RavenDB for Large version, GoodSync FTP Replication Costs for medium version : Hardware plus licencing for Microsoft Windows Server 2012 Enterprise & Microsoft SQL server 2012 standard

The Appliance is a disruptive technology GAME CHANGER Challenge: fit everything that a large data linkage center does into a single shrink wrapper product Opportunity: Look back on what we have done and question the design, unique opportunity for reflection and rare in successful operational systems

Challenges UKSeRP: additional database systems need to support Microsoft SQL server, PostgreSQL, Cloudera HADOOP as well as IBM DB2 Warehouse Opportunity to make these system agnostic : remove vender tie in allowing for options in the future Need a probabilistic matching engine to do data linkage Opportunity: our existing system is very slow and unable to be support very well by our trusted third party due to its age. Very gold standard bias Partnership with Curtin University, Australia allowing use to embed there system in the appliance and replace the trusted 3 rd party system. Additional benefit of increasing the capabilities of our matching beyond gold standard machining and looking forward to a continued partnership to explore Bloom Filter matching, Automated matching tuning, Dynamic recompilation of matching relationships based on project needs Special Thanks to James Boyd & James Semmens, Curtin University Replace our residential matching and anonymisation system to Experian AddressBase allowing integration with the matching engine and finer matching down to flats in multi occupancy residences

Linkage: Migrating from ALF to New ALF2 and RALF2 SAIL uses a trusted third party, additional benefit is to inclusion of process monitoring and remote reporting. End Users will be able to see where in the process there requests are much better user experience

Challenges Need delegable/devolved account management. Both appliance and UKSeRP Authorisation, Authorisation, Accounting Opportunity to develop a new security model which covers all aspects of the infrastructure not just the database allowing the model to be validated and used in many ways Modular and extensible provisioning system taking the model and applying the intension Event drive rather than time based, better service to the user Modular User activation :e.g. SAIL DAA, HR system lookup Ability to support multiple two factor authentication systems Linking into JANET MoonShot host organisation authentication (under development)

Challenges Dataset documentation is patchy and vague Structured Documentation is now mandatory for a dataset to be loaded into the appliance. Dataset can only be loaded into SAIL using the appliance. Ability to attach artefacts (supporting documents) to a dataset Ability to load data as reference / lookup data within a dataset Opportunity: Partnership with Manitoba Centre for Health Policy, Canada Special thanks to Mark Smith. Bring the automated data quality reporting to the appliance and turn the appliance back on the SAIL Databank to automatically collate and measure the quality metrics. Now have database system agnostic data quality module. Looking forward to a continued partnership to look at measuring not just variable quality but relationships between variables and other datasets as well as creating a pluggable architecture to do dataset specific statistical analysis Opportunity: Create an automatic Data Catalogue based on the datasets documentation, computed metrics and validation rules. Link into IHDLN Working Group on Metadata

Challenges Data movement and routing both data and metadata Automated splitting of data files Trusted third parties Data refreshes Organisations outside NHS Sharing / Subscriptions

Why such a disruptive technology (6 months to build!!) The system was fine and everybody happy As we started designing became obvious how the system should be reconfigured with the data appliance as the central component. SAIL DATABANK IBM DB2 Remote Access Security / IG SAIL Technical Additional Data Over simplified representation of SAIL Databank

Why such a disruptive technology (6 months to build!!) SAIL Databank has/will/is becoming an instance of UKSeRP and fully dependant on Data Appliance Remote Access / VDI Other Major Programme SAIL DATABANK Security / IG Data Appliance BI / Compute NLP HADOOP MS SQL IBM DB2 Data Management and Loading Additional Data File Splitting & Separataion Versioning Auto Documentor Data Metrics and Quality reporting Data Catalogue Data transportation

Why such a disruptive technology (6 months to build!!) Deployment of SAIL Databank Satellites / SAIL Mini 7 NHS trusts of Wales 4 NHS trusts of Bristol (England) 1 NHS Trust of North Devon Major upgrades to our trusted third party (NWIS) SAIL Databank major technology upgrade ADRC none health focused

Data Appliance Simon Ellwood-Thompson SWANSEA UNIVERSITY SIMON@CHI.SWAN.AC.UK