Lecture 11 Data storage and LIMS solutions Stéphane LE CROM lecrom@biologie.ens.fr
Various steps of a DNA microarray experiment Experimental steps Data analysis Experimental design set up Chips on catalog Home made chips Available databases Hybridisation Data mining Image analysis Raw data treatment - Normalisation Statistical analysis Storage Data treatment Data representation Clustering
DNA microarray bioinformatic analysis Data storage
Data flow management Example of a data management structure Web databases Public databases Internet Intranet Images obtained from the scanner Images Image analysis Raw data File Server Raw data Normalised data Normalisation Published data Normalised data Web interface
Data management with microarrays There are three management levels for microarrays: 1. Public data repository Built on the most flexible schema tp ensure heterogeneous data storage such as data coming from several organism studies, different protocols and data analysis process. 2. Institutional database Built in order to help a group of users on a dedicated technical platform or to fit a dedicated project. 3. Locally installed database Built and installed for a small user group and to answer very specific and precise questions.
DNA microarray bioinformatic analysis LIMS: Laboratory Information Management System
LIMS: the data management core A local database to follow experiment LIMS = Laboratory Information Management System - Each experimental steps of the protocol is stored in the database. - It allows array and quality control follow-up. - If flexible, it allows analyses with several DNA microarray slide types. - It currently has a set of tables to help gene name determination and to create links with data mining tools. - It allows various visualisation types. -The LIMS characteristics can be customized and can adjusted to the user needs (supplementary tables, dedicated query module). The LIMS must be think up as modules Glass slide GeneChip slide
How to build a LIMS database The key steps to build a LIMS database: 1. How to choose the database management system - Take into account its price, its final use and the expansion abilities Scalability Ease Price Oracle *** * MS SQL ** ** PostgreSQL ** * 0 MySQL ** * 0 Access * *** FileMaker * *** 2. Take your time to design its schema: it will determine its use 3. Think about the security aspect from the beginning (data loss, data integrity corruption ) 4. Always keep data in its raw format
How to build a LIMS database The key steps to build a LIMS database: 5. Make filters to import and export data - Some languages are more dedicated to that type of data handling (PHP, Perl, Java ) 6. Build links towards external sources 7. Use available standards - HTML, SQL - MIAME, MAGE-OM 8. Trace each modification and data treatment steps 9. Do not forget to BACKUP your data 10. Try to build a database that can evolve in the future - It is impossible to solve all problems in one time
Expression data databases Open Source projects: BASE MADAM MaxdSQL Local installation SMD GeneX Public repository ArrayExpress GEO GXD RAD ChipDB Public querying
BASE - BioArray Software Environment A database for local management of microarray data: Plug-in structure Storage of all important steps of a DNA microarray experiment MIAME and MAGE-ML compliant Open Source project (MySQL/Linux) Website: http://base.thep.lu.se/
BASE Home Page The array production part of BASE (Array LIMS) is an optional component. A list of reporters (the probes on the array) can be created or uploaded via an existing file at any time: it enables the user to annotated them (its identifiers, position on chromosome...)
Sample management BASE was designed to follow a natural workflow of microarray data. Samples are the starting point of all data analysis in BASE.
Create an extract from sample
Labeled extract Several labeling steps and protocols can be applied on each extract There is no management of amplified extracts
Protocol follow-up in BASE Each experimental step (sample, extract, labeled extract, hybridisation ) has to have an associated protocol in the database.
Hybridization management
Hybridization and scan
Creation of a Raw Data Set Select the result file.
Experiment management An experiment is a collection of Raw Data Sets associated with any analysis steps.
Experiment analysis steps
Plot visualisation system
Experiment explorer tool
DNA microarray bioinformatic analysis Public data repository
Goal: - To give access to raw data for published data validation - To enable comparison and exchange with other research groups - To allow comparison of microarray design - To enable new analysis methods developments Examples: - ArrayExpress - EBI - Gene Expression Omnibus - NCBI - Stanford Microarray Database - Stanford - ExpressDB - Harvard Expression data repository http://www.ncbi.nlm.nih.gov/geo/
Data standardisation - MGED Why do we need to define a standard? - To specify the minimal information to give in order to characterise a DNA microarray experiment - To allow data interpretation and verification by other laboratories - To simplify data repository set up and result exchange between laboratories http://www.mged.org
Microarray data standard - MIAME MIAME - Minimum Information About Microarray Experiment: - The MIAME standard is defined as the minimal information that must be submitted with microarrays to allow their use, another normalisation or a new possible interpretation. - The MIAME standard is not designed as a questionnaire that can be filled in, but only as an informal specification on which an annotation tool, can be based. - Although MIAME is conceptually independent on databases, the aim of establishing a microarray database should be kept in mind when reading MIAME. - This standard is formed of 6 different parts: 1. Experimental design: contain all hybridisation informations. 2. Array design: contain data on each microarray used and on all spotted reporters. 3. Samples: describes each sample used with their preparation and labeling conditions. 4. Hybridisations: contains protocols and parameters. 5. Measurement: bundles all the data, images and quantification methods. 6. Controls: describes the different controls used.
Exchange data format - MAGE-ML MAGE-ML - Microarray and Gene Expression Data Markup Language: - Exchange format based on XML language to allow the storage and the transfer of organized microarray data. - Format that bundles all the necessary information to create the MIAME dataset. http://www.mged.org
Functional gene annotation Gene Ontology: An ontology is a specification, which includes relations between concepts. Ontologies are necessary to: - eliminate ambiguities - give semantic constraints - create a shared language between human and and computers - allow reliable comparisons => Standard vocabulary creation http://www.geneontology.org/
GO: hierarchical modelling of concepts 1 Gene Ontology: - Use biological terms to qualify microarray results - Use microarray results to extend a biological knowledge database - Exploit results for data mining 2 3
DNA microarray bioinformatic analysis Mining expression data
Databases to help microarray data analysis Many tools are available: - YPD (Incyte) - http://www.proteome.com/ypdhome.html - SGD (Stanford) - http://genome-www.stanford.edu/saccharomyces/ - Webminer (Walter Lab, UCSF) - http://webminer.ucsf.edu - ExpressDB (Church Lab, Harvard) - http://arep.med.harvard.edu/cgi-bin/expressdbyeast/exdstart But need some improvement: - They only allow queries for genes sharing a defined transcription profile - They use few datasets - They often lack use facility and graphical analysis tools
Examples of expression data database Yeast Proteome Database (YPD) http://www.proteome.com/databases
Examples of expression data database Saccharomyces Genome Database (SGD) http://genome-www.stanford.edu/saccharomyces/
How to cross expression data? Expression data mining for one gene Access to publication dataset Finds common profiles between experiments Compares gene expression between experiments Search for coregulated genes http://www.transcriptome.ens.fr/ymgv/ S. Le Crom et al. (2002) Nucleic Acids Research 30(1): 76-79 P. Marc et al. (2001) Nucleic Acids Research 29(13): E63-3
Allowing easy access to gene expression 1 profile by publication 1 histogram by condition (experiment) Data mining: gene name gene ontology variations
Key information retrieval on each publication Experimental condition description Publication overview Gene expression data distribution
Find orthologous gene expression data Schizosaccharomyces pombe OXA1 orthologous gene: Similar expression of orthologs genes during same biological process or stress exposure can give some interesting hints about underlying regulation. Keep in mind that this kind of relationship can occur by chance.
The PDR network example Search for one gene: Hughues et al. (2000) Cell 102: 109 PDR1 Search for several genes: Sudarsanam et al. (2000) PNAS 97: 3364 PDR1 PDR3 => An ergosterols biosynthesis regulation pathway involvement? => Chromatin factors modify PDR3 activity
Find correlations between experiments YFH1 deletion: Foury et Talibi (2001) J. Biol. Chem. 276: 7762 Zinc deprivation: Lyons et al. (2000) PNAS 97: 7957 (mitochondrial protein involved in iron binding and storage) Search for genes induced more than 3 times Stress response proteins, metals transporters, => Zinc and iron regulation mechanisms inter-connection.
and between organisms
Look for co-regulated genes Apply distance calculation on expression profiles in a selected subset of experiments Display the list of query closely related genes among the selected publication set.
ymgv statistics Advantages: - An intuitive and simple interface - Quick answers - Statistics to better understand the data - Ready to add more organisms Improvements: - The database only contains the final ratio after the filtering steps - The data where not re-normalised - The dataset retrieval is not always available
Further proteome and transcriptome analyses http://yeast.cellzome.com KEGG : Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp/kegg/
ENS transcriptome bioinformatics Stéphane LE CROM - Gaëlle LELANDAIS - Sophie LEMOINE - Laurent JOURDREN http://transcriptome.ens.fr