LabKey Server: An open source platform for scientific data integration, analysis, and collaboration



Similar documents
Adam Rauch Partner, LabKey Software Extending LabKey Server Part 1: Retrieving and Presenting Data

Karl Lum Partner, LabKey Software Evolution of Connectivity in LabKey Server

CPAS Overview. Josh Eckels LabKey Software

Brian Connolly Systems Engineer, LabKey Software LabKey Server in the Cloud

Data Management for Large Studies Robert R. Kelley, PhD. Thursday, September 27, 2012

Harnessing the Web to Accelerate Clinical Research How a LabKey Solution Helps Scientists Coordinate the Search for an HIV/AIDS Vaccine

Putting the pieces together: Integrated Research Data Management Using the LabKey Server

Practical application of SAS Clinical Data Integration Server for conversion to SDTM data

#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf

The Recipe for Sarbanes-Oxley Compliance using Microsoft s SharePoint 2010 platform

LabArchives Electronic Lab Notebook:

E-Commerce: Designing And Creating An Online Store

Pivot Charting in SharePoint with Nevron Chart for SharePoint

TDAQ Analytics Dashboard

Contents. Overview. The solid foundation for your entire, enterprise-wide business intelligence system

SharePoint 2010 Intranet Case Study. Presented by Peter Carson President, Envision IT

Xythos WebFile Server Architecture A Technical Guide to the Core Technology, Components, and Design of the Xythos WebFile Server Platform

A New Tool for Monitoring CMS Tier 3 LHC Data Analysis Centers

A Web-Based Data Management System for Small And Medium Clinical Trials

SQL Server Administrator Introduction - 3 Days Objectives

Course MS55077A Project Server 2013 Development. Length: 5 Days

Filestor Digital Asset Management. The way it works

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

An ESRI White Paper October 2009 ESRI Geoportal Technology

Sisense. Product Highlights.

John Perrotta j.perrotta@ninesharp.co.uk

Jiří Kadlec and Daniel P. Ames*

Mascot Integra: Data management for Proteomics ASMS 2004

Quick start. A project with SpagoBI 3.x

MS-55052: SharePoint 2013 End User Level II

SQL Server 2012 Business Intelligence Boot Camp

Communiqué 4. Standardized Global Content Management. Designed for World s Leading Enterprises. Industry Leading Products & Platform

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

EMBL Identity & Access Management

Microsoft Technology Practice Capability document. MOSS / WSS Building Portal based Information Worker Solutions. Overview

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Day 1 - Technology Introduction & Digital Asset Management

How To Understand The History Of A Webmail Website On A Pc Or Macodeo.Com

Model Driven Laboratory Information Management Systems Hao Li 1, John H. Gennari 1, James F. Brinkley 1,2,3 Structural Informatics Group 1

MicroStrategy Course Catalog

Seamless Web Data Entry for SAS Applications D.J. Penix, Pinnacle Solutions, Indianapolis, IN

BarTender Print Portal. Web-based Software for Printing BarTender Documents WHITE PAPER

Current Order Tool Experiences Complaints

PBI365: Data Analytics and Reporting with Power BI

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Two new DB2 Web Query options expand Microsoft integration As printed in the September 2009 edition of the IBM Systems Magazine

INSPIRE Dashboard. Technical scenario

Visualization of Semantic Windows with SciDB Integration

Test Data Management Concepts

NetWrix SQL Server Change Reporter. Quick Start Guide

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Content Management Systems: Drupal Vs Jahia

SharePoint 2013 for Business Process Automation

Implementing and Maintaining Microsoft SQL Server 2005 Reporting Services COURSE OVERVIEW AUDIENCE OUTLINE OBJECTIVES PREREQUISITES

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

NaviCell Data Visualization Python API

Open Source Technologies on Microsoft Azure

Last Updated: July STATISTICA Enterprise Server Security

Data Lab Operations Concepts

<Insert Picture Here> Michael Hichwa VP Database Development Tools Stuttgart September 18, 2007 Hamburg September 20, 2007

SQL Server Training Course Content

Use Cases for Argonaut Project. Version 1.1

Dell Statistica Web Data Entry

WEBFOCUS QUICK DATA FOR EXCEL

ArcGIS. Server. A Complete and Integrated Server GIS

2012 LABVANTAGE Solutions, Inc. All Rights Reserved.

Integrating Big Data into the Computing Curricula

Chapter 24: Creating Reports and Extracting Data

KRYSTAL DMS User s Guide Enterprise Edition Version 8.0

NoSQL and Agility. Why Document and Graph Stores Rock November 12th, 2015 COPYRIGHT 2015 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED

NetIQ Identity Manager Setup Guide

Biorepository and Biobanking

IO Informatics The Sentient Suite

Reporting Services. White Paper. Published: August 2007 Updated: July 2008

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

BusinessObjects XI R2 Product Documentation Roadmap

THE FUTURE OF COLLABORATION

IBM InfoSphere MDM Server v9.0. Version: Demo. Page <<1/11>>

Client Requirement. Why SharePoint

Lawrence Tarbox, Ph.D. Washington University in St. Louis School of Medicine Mallinckrodt Institute of Radiology, Electronic Radiology Lab

ActiveVOS Server Architecture. March 2009

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Adobe Systems Incorporated

Software Design Proposal Scientific Data Management System

100% NO CODING NO DEVELOPING IMMEDIATE BUSINESS -25% -70% UNLIMITED SCALABILITY DEVELOPMENT TIME SOFTWARE STABILITY

owncloud Architecture Overview

ABSTRACT TECHNICAL DESIGN INTRODUCTION FUNCTIONAL DESIGN

Document Management Server - Overview

Tool-Assisted Knowledge to HL7 v3 Message Translation (TAMMP) Installation Guide December 23, 2009

Exam Name: IBM InfoSphere MDM Server v9.0

Implementing and Maintaining Microsoft SQL Server 2008 Reporting Services

A collaborative platform for knowledge management

e-science Applications with Duckling Collaboration Library

SHAREPOINT 2016 POWER USER BETA. Duration: 4 days

EcoTrends Cyber-infrastructure Development

ArcGIS 10.1 Geodatabase Administration. Gordon Sumerling & Christopher Brown

Course Code NCS2013: SharePoint 2013 No-code Solutions for Office 365 and On-premises

Microsoft Dynamics GP Architecture. White Paper. This document describes the architecture for Microsoft Dynamics GP.

WHITE PAPER. Domo Advanced Architecture

Case Study. Web Application for Financial & Economic Data Analysis Brainvire Infotech Pvt. Ltd Page 1 of 1

Transcription:

LabKey Server: An open source platform for scientific data integration, analysis, and collaboration Mark Igra Partner, LabKey Software marki@labkey.com

Presentation Topics Why Scientific Data Integration Evolution of LabKey Server Common requirements for scientific data integration Architecture Data Model Security Extensibility More Future Directions 2

Why Scientific Data Integration Data Volume Hundreds of millions of results from thousands of high throughput assay runs Data Variety Clinical, Demographic, Assay & Specimen Data Scientific data annotation Collaboration Investigators aren t all in the same place Need secure data sharing Broad Range of Analyses Queries, Reports, Domain Specific Tools 3

Evolution of LabKey Server Core team came from software industry, not bioinformatics Moved to Hutchinson Center to work on proteomics for biomarker discovery Expanded from there into More assays Study data management Lab Data Management Evolution of LabKey is case study in shared requirements 4

Proteomics.fhcrc.org Problems Faced Number of reads from MS2-based proteomics assay were exploding & difficult to handle on existing tools Many assay runs needed to be done to fulfill NCI contract for early biomarker detection Analysis Pipeline on cluster was difficult to optimize Solution The original LabKey Server (CPAS) Pipeline to run MS2 analysis jobs via web browser and load results into Database Web based analysis tools to view, combine & share results Run by FHCRC Shared Resources Lots of Data More than 90,000 MS2 Runs More than 700,000,000 peptide identifications Rauch et al, Journal of Proteome Research, 5/2006 5

Proteomics Examples 6

Software Requirement Summary Data Size Proteomics 700M Peptides 90K Runs Data Pipeline Collaboration Specimens Security 7

Flow Cytometry Problem Lab using Flow Cytometry to measure intracellular cytokines in many samples Per-run quantity of data increasing Need consistent analysis within an experiment type Cross-run analyses Each experiment type may have different statistics Solution: LabKey Flow High-throughput flow analysis engine Loading of flow statistics Adaptable data model for varying analysis types Query tools for analyzing data Shulman et al, Cytometry A, Sep 8 2008 8

Flow Results 9

View Graphs 10

Customize View 11

Software Requirement Summary Data Size Proteomics 700M Peptides 90K Runs Flow 50M Statistics Data Pipeline Collaboration Specimens Security Data Variety Query 12

Atlas SCHARP Data Portal Problems faced Combine many data types for HIV Vaccine studies Clinical Response Forms (CRF), Specimens, Many Assays Enable secure collaboration for scientists worldwide Allocate & distribute valuable specimens Solution Secure web portal for HIV Vaccine Enterprise Data Used by several networks to share data CHAVI, CAVD, HVTN, HPTN (3000 Users Worldwide) Core software was written by LabKey SCHARP runs Atlas Defines available data and relationships Manages security and permissions Manages data loading Builds custom modules Nelson et al, BMC Bioinformatics, March 2011 Piehler et al, BMC Immunology, May 2011 13

Atlas Data Flows Data Providers Sample Info labid uspeci txtpid parusp drawdm drawdd drawdy 4633 9.99E+08 4632 3 23 2005 262 308 13472 9.99E+08 13471 3 23 2005 262 4641 9.99E+08 4640 4 14 2005 LIMS 262 4650 9.99E+08 4649 5 9 2005 262 4652 9.99E+08 4651 5 11 2005 308 13480 9.99E+08 13479 5 11 2005 262 4668 9.99E+08 4667 4 13 2005 308 13486 9.99E+08 13485 4 13 2005 262 4684 9.99E+08 4683 5 11 2005 262 4751 9.99E+08 4750 6 21 2005 308 13560 9.99E+08 13559 6 21 2005 262 4769 9.99E+08 4768 6 28 2005 308 13578 9.99E+08 13577 6 28 2005 262 4850 9.99E+08 4849 6 1 2005 262 4897 9.99E+08 4896 8 17 2005 262 4898 9.99E+08 4896 8 17 2005 262 4914 9.99E+08 4913 8 30 2005 308 13933 9.99E+08 13932 8 30 2005 262 4922 9.99E+08 4921 9 1 2005 308 13941 9.99E+08 13940 9 1 2005 262 4934 9.99E+08 4933 9 14 2005 308 13953 9.99E+08 13952 9 14 2005 262 4935 9.99E+08 4933 9 14 2005 308 13954 9.99E+08 13952 9 14 2005 262 4950 9.99E+08 4949 9 28 2005 308 13969 9.99E+08 13968 9 28 2005 262 4951 9.99E+08 4949 9 28 2005 308 13970 9.99E+08 13968 9 28 2005 307 773 9.99E+08 772 8 24 2005 307 774 9.99E+08 772 8 24 2005 307 800 9.99E+08 799 8 30 2005 307 801 9.99E+08 799 8 30 2005 Atlas Web + Database Servers Data Viewers Leadership Labs Labs Assays Forms CRF DOB BP sys BP dia Notes Collaborators 14

Study Page 15

Dataset 16

Participant View 17

Participant Specimen Summary 18

Specimen Details 19

Request Specimens 20

Lab Data Flows Labs Lab Data Process Atlas Web + Database Servers Data Viewers Leadership Assays 1. Request Specimens 2. Run Assay 3. Upload Spreadsheet and Metadata 4. Automated Transform and Analysis (optional) 5. Review 6. Copy to Study Labs Collaborators 21

Custom Applications Dozens of small applications needed for different groups Most used data already in LabKey System But custom workflows, reports & analysis required SCHARP has detailed knowledge of the requirements LabKey provided an API and simple application building tools 22

HVTN IQC Internal Quality Control for Sample Processing Cell Yield & Processing Time

RV144 Followup Study Tracking System Tools for tracking information about studies

Software Requirement Summary Data Size Proteomics Flow Atlas (Studies) 700M Peptides 90K Runs 50M Statistics 20K Subjects 800K Specimens 1200 Datasets Atlas (Labs) 30K Assay Runs Data Pipeline Collaboration Specimens Security Data Variety Query Study Model Custom Reports API Auditing 25

Primate Electronic Health Record Problem 30 years of daily data on thousands of animals Clinical staff & vets need health record Researchers need scientific data Colony Management is an ongoing problem Solution EHR is a a specialized, ongoing study LabKey enhanced scalability of study solution LabKey enhanced API LabKey wrote tools to transfer data into new EHR within minutes of entry into old EHR Now LabKey EHR is only one in use Wisconsin Primate Center built custom views & reports to analyze the data 26

Primate EHR 27

Software Requirement Summary Data Size Proteomics Flow Atlas (Studies) 700M Peptides 50M Statistics 20K Subjects 90K Runs 800K Specimens 1200 Datasets Atlas (Labs) 30K Assay Runs Data Pipeline Primate EHR 13K Animals 1.2M Drug Doses Collaboration Specimens Security Data Variety Query Study Model Custom Reports API Auditing 28

Software Requirement Summary Data Size Proteomics Flow Atlas (Studies) 700M Peptides 90K Runs 50M Statistics 20K Subjects 800K Specimens 1200 Datasets Atlas (Labs) 30K Assay Runs Data Pipeline Primate EHR 13K Animals 1.2M Drug Doses Collaboration Specimens Security Data Variety Query Study Model Custom Reports API Auditing 29

User Requirements Data integration is the unifying theme Samples to Assays Assays to Subjects Subjects to Studies Translational Medicine involves integrating data from Molecules to Populations (Kuhn et al 2008) Data types to integrate are constantly evolving Both file and structured data types Can t rebuild new systems for new data types Tools grow as fast as data types Scientists need to share data Central lab doing work for distributed clients Distributed group of labs contributing Need to manage many users 30

Basic Architecture Java web application Runs on Apache Tomcat web server Compatible with Windows, Linux, Solaris, Mac, etc Incorporates open-source libraries Extensible via modules Relational database server PostgreSQL: open-source, all common operating systems Microsoft SQL Server: commercial product, Windows only Abstraction layer allows other database servers Network file storage Analysis pipeline: conversion, search, processing Secure WebDAV access to pipeline and data 31

LabKey Basic Architecture LK Web Server (Tomcat) File System SQL Database (PostgreSQL, MS SQL) LabKey Schemas

LabKey Data Connectivity LK Web Server (Tomcat) File System LabKey PostgreSQL/MS SQL Database File System 2 SAS Share Other PostgreSQL Database MS SQL Database My SQL LabKey Schemas More Schemas Data 1 Data 3

Site Admin Portal / Wiki CPAS (Proteomics) Flow Cytometry Text Assays Custom Assay 1 Custom Assay 2 Labkey Server Architecture Interaction Layer: Security, Web Views, API Studies Integrate Many Data Types Experiment Services Base Services (Security, Data Services, Full Text Search) Data Storage (Relational Database + File System) = Modules = Shared services

Major Feature Areas Data Model Security Extensibility File Handling Assays Data pipelines to get data in Reports & visualizations to get data out Auditing 35

Data Model Goals & Requirements Consistent User interface Reporting & Querying Seamless data integration Select columns across study, assay, specimen data Extensible Data types are dynamic & user defined Existing data stored in external systems Rich Column Types Concepts, Lookups, Formatting, URL, Validation Out of Range (e.g. < 30), SAS-style missing values Secure: See security slide 36

LabKey Data Model Data Exposed via Relational Model Virtual Schema Modules expose a User Schema User Schema maps tables/columns to queries on underlying database User Schema depends on user s permissions Users typically query by editing grid views Add columns from related tables Filter, Sort Full SQL Queries Available SQL translated to underlying relational database 37

Query Processing Custom Application (Perl, R, JavaScript) Using API SQL Query or Table + Column List Tabular Data (including additional metadata) API Layer LabKey Server LabKey Query Service Translated SQL Tabular Data Relational DB 38

Major Feature Areas Data Model Security Extensibility File Handling Assays Data pipelines to get data in Reports & visualizations to get data out Auditing 39

Security Model Requirements Pervasive Applies to data no matter how you get at it Modules, Search, Reporting, Queries, API, Files Easy Partitioning of Data by security context Easy Administration Uniform administration across applications Integrates with existing authentication systems Easy implementation by custom modules Extensible where necessary 40

Folders, Files and Data Folder 1 Folder 2 Files Proteomics Flow Data and files are visible in folders 41

Major Feature Areas Data Model Security Extensibility File Handling Assays Data pipelines to get data in Reports & visualizations to get data out Auditing 42

Developers, Admins & Users Users: Create custom views of their data using graphical tools Administrators Configure LabKey Server Set up projects and security. Customize schema via data sets, assays, etc. Create SQL queries and reports Application Module Developers Build Custom Modules solving particular problems Modules may be private or shared, open source or not File-based modules (HTML + Script + SQL + XML) Java Modules (usually built by LabKey) Platform Developers: Build the core LabKey Server and key modules Java code that is distributed in the LabKey Server distribution. LabKey Software employees with few exceptions 43

Reporting Needs to be available on all data in the system Need flexible reporting infrastructure LabKey cannot predict reports needed Integrate R Interactive reports are preferable Grids are interactive now R reports allow filtering/redisplaying data Interactive graphs under development 44

Report Examples 45

File Handling Requirements Files are the fundamental unit of data exchange Contain structured & unstructured data Sometimes huge Sometimes many little files Need to find the files Full text search + properties Need rapid & flexible file transfer Need to analyze & load structured data within files Extensible actions including load data 46

WebDAV Connectivity LK Web Server (Tomcat) File System 47

Auditing Audit service available to all modules Modules responsible for auditing actions Easy implementation for basic data auditing Would be nice if this were automatic Many modules have custom audit events (e.g. login, password reset etc) Each audit event type can have custom associated data E.g. list updates store old values in audit trail 48

Futures Better out of the box interface Data Integration Improvements Data Loading ID Management Smart Union of related datasets Expand External Tools & Services Integration More Interactive Visualizations 49

Summary Data Integration is a common need across scientific endeavors LabKey Server is a platform for addressing these challenges Secure Built for specialized challenges of scientific data Extensible Open We continue to work hard to advance the platform capabilities and usability 50

Acknowledgements Partners & Funders Martin McIntosh (FHCRC) NCI Canary Foundation Steve Self (SCHARP) CHAVI (NIH) HVTN (NIH) CAVD (Gates Foundation) David O Connor (Wisconsin) Primate EHR (ARRA) Genotyping Tools (NIAID) Parag Mallick (USC) Michael Katze (UW) Immune Tolerance Network LabKey Development Matthew Bellew Adam Rauch Britt Piehler Josh Eckels Kevin Krause Brendan MacLean (now UW) Nick Shulman (now UW) Karl Lum Nick Arnold Cory Nathe Ben Bimber Alan Veniza Documentation Elizabeth Nelson Steve Hanson Test Trey Chaddick Elizabeth Van Nostrand Management & Operations Britt Piehler Kristin Fitzimmons Peter Hussey Ren Lis 51