Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM lecrom@biologie.ens.fr



Similar documents
Gene expression analysis. Ulf Leser and Karin Zimmermann

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Analysis of gene expression data. Ulf Leser and Philippe Thomas

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Processing Genome Data using Scalable Database Technology. My Background

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Analysis of Illumina Gene Expression Microarray Data

A Primer of Genome Science THIRD

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Mascot Integra: Data management for Proteomics ASMS 2004

Exercise with Gene Ontology - Cytoscape - BiNGO

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Row Quantile Normalisation of Microarrays

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Basic Analysis of Microarray Data

Web-Based Genomic Information Integration with Gene Ontology

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Protein Protein Interaction Networks

Matthias Lange. gatersleben.de Bioinformatics Progress Seminar, May 08, BI Progress 05/07/2008 M. Lange: Data IPK

ANALYSIS OF ENTITY-ATTRIBUTE-VALUE MODEL APPLICATIONS IN FREELY AVAILABLE DATABASE MANAGEMENT SYSTEMS FOR DNA MICROARRAY DATA PROCESSING 1.

Chapter 4.3. of Molecular Plant Physiology Am Mühlenberg 1, D Golm, GERMANY;

Tutorial for proteome data analysis using the Perseus software platform

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

CPAS Overview. Josh Eckels LabKey Software

A collaborative platform for knowledge management

Karl Lum Partner, LabKey Software Evolution of Connectivity in LabKey Server

Biorepository and Biobanking

Visualizing Networks: Cytoscape. Prat Thiru

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

The Open2Dprot Proteomics Project for n-dimensional Protein Expression Data Analysis

ProteinQuest user guide

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

High Throughput Sequencing Data Analysis using Cloud Computing

An Introduction to Genomics and SAS Scientific Discovery Solutions

Bioinformatics Grid - Enabled Tools For Biologists.

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Check Your Data Freedom: A Taxonomy to Assess Life Science Database Openness

Frequently Asked Questions (FAQ)

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Quantitative proteomics background

ProteinScape. Innovation with Integrity. Proteomics Data Analysis & Management. Mass Spectrometry

Software options for the analysis of micorarray data

PeptidomicsDB: a new platform for sharing MS/MS data.

An Introduction to Microarray Data Analysis

Internet accessible facilities management

Genevestigator Training

Genetomic Promototypes

Data Management for Large Studies Robert R. Kelley, PhD. Thursday, September 27, 2012

Biotracker TM A Laboratory Information Management System By Ocimum Biosolutions

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

A Web services solution for Work Management Operations. Venu Kanaparthy Dr. Charles O Hara, Ph. D. Abstract

Scientific databases. Biological data management

JustClust User Manual

GeneProf and the new GeneProf Web Services

Module 1. Sequence Formats and Retrieval. Charles Steward

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Measuring gene expression (Microarrays) Ulf Leser

EMBL Identity & Access Management

1 File Processing Systems

How many of you have checked out the web site on protein-dna interactions?

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Core Facility Genomics

Search Result Optimization using Annotators

Chapter 2 Database System Concepts and Architecture

BIOLOMICS SOFTWARE & SERVICES GENERAL INFORMATION DOCUMENT

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

How Real-time Analysis turns Big Medical Data into Precision Medicine?

Abdullah Mohammed Abdullah Khamis

Adam Rauch Partner, LabKey Software Extending LabKey Server Part 1: Retrieving and Presenting Data

Transcription:

Lecture 11 Data storage and LIMS solutions Stéphane LE CROM lecrom@biologie.ens.fr

Various steps of a DNA microarray experiment Experimental steps Data analysis Experimental design set up Chips on catalog Home made chips Available databases Hybridisation Data mining Image analysis Raw data treatment - Normalisation Statistical analysis Storage Data treatment Data representation Clustering

DNA microarray bioinformatic analysis Data storage

Data flow management Example of a data management structure Web databases Public databases Internet Intranet Images obtained from the scanner Images Image analysis Raw data File Server Raw data Normalised data Normalisation Published data Normalised data Web interface

Data management with microarrays There are three management levels for microarrays: 1. Public data repository Built on the most flexible schema tp ensure heterogeneous data storage such as data coming from several organism studies, different protocols and data analysis process. 2. Institutional database Built in order to help a group of users on a dedicated technical platform or to fit a dedicated project. 3. Locally installed database Built and installed for a small user group and to answer very specific and precise questions.

DNA microarray bioinformatic analysis LIMS: Laboratory Information Management System

LIMS: the data management core A local database to follow experiment LIMS = Laboratory Information Management System - Each experimental steps of the protocol is stored in the database. - It allows array and quality control follow-up. - If flexible, it allows analyses with several DNA microarray slide types. - It currently has a set of tables to help gene name determination and to create links with data mining tools. - It allows various visualisation types. -The LIMS characteristics can be customized and can adjusted to the user needs (supplementary tables, dedicated query module). The LIMS must be think up as modules Glass slide GeneChip slide

How to build a LIMS database The key steps to build a LIMS database: 1. How to choose the database management system - Take into account its price, its final use and the expansion abilities Scalability Ease Price Oracle *** * MS SQL ** ** PostgreSQL ** * 0 MySQL ** * 0 Access * *** FileMaker * *** 2. Take your time to design its schema: it will determine its use 3. Think about the security aspect from the beginning (data loss, data integrity corruption ) 4. Always keep data in its raw format

How to build a LIMS database The key steps to build a LIMS database: 5. Make filters to import and export data - Some languages are more dedicated to that type of data handling (PHP, Perl, Java ) 6. Build links towards external sources 7. Use available standards - HTML, SQL - MIAME, MAGE-OM 8. Trace each modification and data treatment steps 9. Do not forget to BACKUP your data 10. Try to build a database that can evolve in the future - It is impossible to solve all problems in one time

Expression data databases Open Source projects: BASE MADAM MaxdSQL Local installation SMD GeneX Public repository ArrayExpress GEO GXD RAD ChipDB Public querying

BASE - BioArray Software Environment A database for local management of microarray data: Plug-in structure Storage of all important steps of a DNA microarray experiment MIAME and MAGE-ML compliant Open Source project (MySQL/Linux) Website: http://base.thep.lu.se/

BASE Home Page The array production part of BASE (Array LIMS) is an optional component. A list of reporters (the probes on the array) can be created or uploaded via an existing file at any time: it enables the user to annotated them (its identifiers, position on chromosome...)

Sample management BASE was designed to follow a natural workflow of microarray data. Samples are the starting point of all data analysis in BASE.

Create an extract from sample

Labeled extract Several labeling steps and protocols can be applied on each extract There is no management of amplified extracts

Protocol follow-up in BASE Each experimental step (sample, extract, labeled extract, hybridisation ) has to have an associated protocol in the database.

Hybridization management

Hybridization and scan

Creation of a Raw Data Set Select the result file.

Experiment management An experiment is a collection of Raw Data Sets associated with any analysis steps.

Experiment analysis steps

Plot visualisation system

Experiment explorer tool

DNA microarray bioinformatic analysis Public data repository

Goal: - To give access to raw data for published data validation - To enable comparison and exchange with other research groups - To allow comparison of microarray design - To enable new analysis methods developments Examples: - ArrayExpress - EBI - Gene Expression Omnibus - NCBI - Stanford Microarray Database - Stanford - ExpressDB - Harvard Expression data repository http://www.ncbi.nlm.nih.gov/geo/

Data standardisation - MGED Why do we need to define a standard? - To specify the minimal information to give in order to characterise a DNA microarray experiment - To allow data interpretation and verification by other laboratories - To simplify data repository set up and result exchange between laboratories http://www.mged.org

Microarray data standard - MIAME MIAME - Minimum Information About Microarray Experiment: - The MIAME standard is defined as the minimal information that must be submitted with microarrays to allow their use, another normalisation or a new possible interpretation. - The MIAME standard is not designed as a questionnaire that can be filled in, but only as an informal specification on which an annotation tool, can be based. - Although MIAME is conceptually independent on databases, the aim of establishing a microarray database should be kept in mind when reading MIAME. - This standard is formed of 6 different parts: 1. Experimental design: contain all hybridisation informations. 2. Array design: contain data on each microarray used and on all spotted reporters. 3. Samples: describes each sample used with their preparation and labeling conditions. 4. Hybridisations: contains protocols and parameters. 5. Measurement: bundles all the data, images and quantification methods. 6. Controls: describes the different controls used.

Exchange data format - MAGE-ML MAGE-ML - Microarray and Gene Expression Data Markup Language: - Exchange format based on XML language to allow the storage and the transfer of organized microarray data. - Format that bundles all the necessary information to create the MIAME dataset. http://www.mged.org

Functional gene annotation Gene Ontology: An ontology is a specification, which includes relations between concepts. Ontologies are necessary to: - eliminate ambiguities - give semantic constraints - create a shared language between human and and computers - allow reliable comparisons => Standard vocabulary creation http://www.geneontology.org/

GO: hierarchical modelling of concepts 1 Gene Ontology: - Use biological terms to qualify microarray results - Use microarray results to extend a biological knowledge database - Exploit results for data mining 2 3

DNA microarray bioinformatic analysis Mining expression data

Databases to help microarray data analysis Many tools are available: - YPD (Incyte) - http://www.proteome.com/ypdhome.html - SGD (Stanford) - http://genome-www.stanford.edu/saccharomyces/ - Webminer (Walter Lab, UCSF) - http://webminer.ucsf.edu - ExpressDB (Church Lab, Harvard) - http://arep.med.harvard.edu/cgi-bin/expressdbyeast/exdstart But need some improvement: - They only allow queries for genes sharing a defined transcription profile - They use few datasets - They often lack use facility and graphical analysis tools

Examples of expression data database Yeast Proteome Database (YPD) http://www.proteome.com/databases

Examples of expression data database Saccharomyces Genome Database (SGD) http://genome-www.stanford.edu/saccharomyces/

How to cross expression data? Expression data mining for one gene Access to publication dataset Finds common profiles between experiments Compares gene expression between experiments Search for coregulated genes http://www.transcriptome.ens.fr/ymgv/ S. Le Crom et al. (2002) Nucleic Acids Research 30(1): 76-79 P. Marc et al. (2001) Nucleic Acids Research 29(13): E63-3

Allowing easy access to gene expression 1 profile by publication 1 histogram by condition (experiment) Data mining: gene name gene ontology variations

Key information retrieval on each publication Experimental condition description Publication overview Gene expression data distribution

Find orthologous gene expression data Schizosaccharomyces pombe OXA1 orthologous gene: Similar expression of orthologs genes during same biological process or stress exposure can give some interesting hints about underlying regulation. Keep in mind that this kind of relationship can occur by chance.

The PDR network example Search for one gene: Hughues et al. (2000) Cell 102: 109 PDR1 Search for several genes: Sudarsanam et al. (2000) PNAS 97: 3364 PDR1 PDR3 => An ergosterols biosynthesis regulation pathway involvement? => Chromatin factors modify PDR3 activity

Find correlations between experiments YFH1 deletion: Foury et Talibi (2001) J. Biol. Chem. 276: 7762 Zinc deprivation: Lyons et al. (2000) PNAS 97: 7957 (mitochondrial protein involved in iron binding and storage) Search for genes induced more than 3 times Stress response proteins, metals transporters, => Zinc and iron regulation mechanisms inter-connection.

and between organisms

Look for co-regulated genes Apply distance calculation on expression profiles in a selected subset of experiments Display the list of query closely related genes among the selected publication set.

ymgv statistics Advantages: - An intuitive and simple interface - Quick answers - Statistics to better understand the data - Ready to add more organisms Improvements: - The database only contains the final ratio after the filtering steps - The data where not re-normalised - The dataset retrieval is not always available

Further proteome and transcriptome analyses http://yeast.cellzome.com KEGG : Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/

ENS transcriptome bioinformatics Stéphane LE CROM - Gaëlle LELANDAIS - Sophie LEMOINE - Laurent JOURDREN http://transcriptome.ens.fr