GnpIS: an information system for plant breeding

Similar documents
-> Integration of MAPHiTS in Galaxy

Towards the construction of an integrated Wheat Information System

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

Simplifying Data Interpretation with Nexus Copy Number

i2b2 Clinical Research Chart

i2b2 Clinical Research Chart

Data Grids. Lidan Wang April 5, 2007

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Processing Genome Data using Scalable Database Technology. My Background

An Introduction to Genomics and SAS Scientific Discovery Solutions

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

A Primer of Genome Science THIRD

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Marker-Assisted Backcrossing. Marker-Assisted Selection. 1. Select donor alleles at markers flanking target gene. Losing the target allele

Databases. DSIC. Academic Year

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Customer Bank Account Management System Technical Specification Document

INRA's Big Data perspectives and implementation challenges. Pascal Neveu UMR MISTEA INRA - Montpellier

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Bioinformatics Resources at a Glance

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Files. Files. Files. Files. Files. File Organisation. What s it all about? What s in a file?

Research Roadmap for the Future. National Grape and Wine Initiative March 2013

IMPLEMENTATION OF DATA WAREHOUSE SAP BW IN THE PRODUCTION COMPANY. Maria Kowal, Galina Setlak

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

DevOps Ignition to reach Galaxy continuous integration

Step by Step Guide to Importing Genetic Data into JMP Genomics

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Using Web Services for Customised Data Entry

PLANT BREEDING: CAN METABOLOMICS HELP?

An EVIDENCE-ENHANCED HEALTHCARE ECOSYSTEM for Cancer: I/T perspectives

GOBII. Genomic & Open-source Breeding Informatics Initiative

CCR Biology - Chapter 9 Practice Test - Summer 2012

Mitochondrial DNA Analysis

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Clinical Research Infrastructure

Data Management for Biobanks

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

European Medicines Agency

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

CORBA and Life Sciences

Fast. Integrated Genome Browser & DAS. Easy. Flexible. Free. bioviz.org/igb

Worksheet - COMPARATIVE MAPPING 1

GRIN-Global Project. the global plant genebank information management system

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

GenBank, Entrez, & FASTA

Human Genome Organization: An Update. Genome Organization: An Update

HOW WILL BIG DATA AFFECT RADIOLOGY (RESEARCH / ANALYTICS)? Ronald Arenson, MD

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Intro to Bioinformatics

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

GENOMIC SELECTION: THE FUTURE OF MARKER ASSISTED SELECTION AND ANIMAL BREEDING

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

ISM 318: Database Systems. Objectives. Database. Dr. Hamid R. Nemati

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle

Unifying the Global Data Space using DDS and SQL

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

The growing challenges of big data in the agricultural and ecological sciences

AgroPortal. a proposition for ontologybased services in the agronomic domain

Efficiently Identifying Inclusion Dependencies in RDBMS

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

A Strategy for Plant Breeding Data Management in International Agricultural Research

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

The i2b2 Hive and the Clinical Research Chart

AN INTEGRATION APPROACH FOR THE STATISTICAL INFORMATION SYSTEM OF ISTAT USING SDMX STANDARDS

Module 1. Sequence Formats and Retrieval. Charles Steward

Introductory genetics for veterinary students

Delivering the power of the world s most successful genomics platform

Introduction course leader and module leaders

BUILDING OLAP TOOLS OVER LARGE DATABASES

Human Genome and Human Genome Project. Louxin Zhang

Analysis of ChIP-seq data in Galaxy

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Alison Yao, Ph.D. July 2014

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Database schema documentation for SNPdbe

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Transcription:

GnpIS: an information system for plant breeding 21th october 2010 Thematic day on Integrative genomics - Nantes Hadi Quesneville The URGI unit A G R I C U L T U R E A L I M E N T A T I O N E N V I R O N N E M E N T

URGI: Unité de Recherche en Génomique-Info Research Unit INRA unit (French National Institute for Agricultural Research) Plant breeding and Genetics Department Strong connexions with other plant INRA departments Host a Bioinformatic platform IBISA Grade Member of the French National Network of Bioinformatic Platforms (ReNaBi) Research Data integration Functional and evolutional genome dynamics The URGI unit 18/11/2010 Hadi Quesneville 2

Databases Central role for data analysis Repository Navigation Link heterogeneous informations Experiments: Transcriptomes, EST, SNP, chip Genome annotations: genes, repeats Genetic informations: markers, recombination rates Lineages: lines, populations, species Performance issues Adapted schema for query 18/11/2010 Hadi Quesneville 3

Genome data integration The URGI unit 18/11/2010 Hadi Quesneville 4

Data integration Data integration involves combining data residing in different sources and providing users with a unified view of these data. (from wikipedia) The URGI unit 18/11/2010 Hadi Quesneville 5

Genome base data integration Natural way to integrate «omics» data Map data to genome sequence Compare genome coordinates Identifiy relationships Compute correlations Few generic operations needed Generic unary and binary operators Density, counts Statistics Visualisations 2 nd Crossomics meeting 18/11/2010 Hadi Quesneville 6

Unary operators (1 source) range merge 18/11/2010 Hadi Quesneville 7

Binary operators (2 sources) overlap add 18/11/2010 Hadi Quesneville 8

Binary operators (2 sources) diff merge 18/11/2010 Hadi Quesneville 9

Inspired from R-tree (Guttman 1984) Hierarchical set of bin Adjacent bin at a level have adjacent ID Bin from different level do not overlap Coordinate indexes Assign segment to smallest bin that contain it Bins simple_feature 1000.0 1Mb simple_feature_id int 100.0 100kb seq_region_id int 10.0 10.1 10kb bin float 1.000000 1. 000001 1. 000002 1. 000003 1.000004 1.000005 1. 000006 1. 000007 1. 000008 1. 000009 1. 000010 1. 000011 1. 000012 1. 000013 1. 000014 1. 000015 1kb seq_region_start int seq_region_end int Chromosome seq_region_strand analysis_id tinyint int score double 18/11/2010 Hadi Quesneville 10

Coordinate indexes 1000.0 100.0 10.0 10.1 1.000000 1. 000001 1. 000002 1. 000003 1.000004 1.000005 1. 000006 1. 000007 1. 000008 1. 000009 1. 000010 1. 000011 1. 000012 1. 000013 1. 000014 1. 000015 1Mb 100kb 10kb 1kb Bins Chromosome 12010000 12020000 simple_feature Select * from simple_feature where seq_region_id=1 AND ( bin = 1000.000012 OR bin = 100.000120 Bin level ( x10 l ) Bin number ( / 10 l ) OR bin between 10.001201 and 10.001202 OR bin between 1.012010 and 1.012020 ) simple_feature_id seq_region_id bin seq_region_start seq_region_end seq_region_strand analysis_id score int int float int int tinyint int double AND seq_region_start<= 12010000 AND seq_region_end >= 12020000 18/11/2010 Hadi Quesneville 11

S-MART The URGI unit Hadi Quesneville

Database data integration The URGI unit 18/11/2010 Hadi Quesneville 13

Architectures Data integration is defined as a triple <G,S,M> where: G is the global (or mediated) schema, S is the heterogeneous set of source schemas, M is the mapping that maps queries between the source and the global schemas. (from wikipedia) Two architectures Data Warehouse Virtual Database 2 nd Crossomics meeting 18/11/2010 Hadi Quesneville 14

GnpMap GnpIS data warehouse EST, mrna Maps DNA Polymorphismes Genome annotation GnpGenomeAster SIReGal Genetic collections Transcriptome Proteome GnpProt Phenotype evaluations (P=GxE) The URGI unit 18/11/2010 Hadi Quesneville 15

Data consistency GnpGenome foreign keys GnpIS Xref GnpSNP A foreign key is a relationship or link between two tables which ensures that the data stored in a database is consistent. GnpArray GnpProt Deleting a record that contains a value referred to by a foreign key in another table would break referential integrity. GnpMap SIReGal Some relational database management systems (RDBMS) can enforce referential integrity GnpSeq Aster Core module Ephesis by deleting the foreign key rows as well to maintain integrity, by returning an error and not performing the delete. Architecturally, this offers a tightly coupled approach because the data reside together in a single repository at query-time The URGI unit 18/11/2010 Hadi Quesneville 16

Interoperability A property referring to the ability of diverse systems to work together (inter-operate) capability of different programs to exchange data via a common set of exchange formats, to read and write the same file formats, and to use the same protocols. (loose) links between RDBMS Xrefs Web services Tools over the data warehouse Quick search Biomart Galaxy (from wikipedia) 2 nd Crossomics meeting 18/11/2010 Hadi Quesneville 17

GnpIS interoperability Submission Queries Pipelines DB Data Mart Web Interfaces submission Complex queries Excel Files Simple queries The URGI unit 18/11/2010 Hadi Quesneville 18

Quick search The URGI unit 18/11/2010 Hadi Quesneville 19

Quick search results The URGI unit 18/11/2010 Hadi Quesneville 20

Biomart: advanced search The URGI unit 18/11/2010 Hadi Quesneville 21

Get QTL by theme, trait, QTL name, markers. Hadi Quesneville

Attributes for results Hadi Quesneville

Results and links for details Delphine Steinbach Hadi Quesneville

Interoperability (QTL) with GnpMap GnpGenome Poplar GnpMap URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 25

( markers QTL mapped on GnpGenome (via URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 26

QTLs found in GnpMap URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 27

A data integration workbench The URGI unit 18/11/2010 Hadi Quesneville 28

Galaxy URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 29

Get data from Biomart URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 30

Text manipulation Exemple: URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 31

Other manipulations URGI Démo GnpIS Dijon Hadi Quesneville 20/05/10 32

Galaxy Workflow The URGI unit 18/11/2010 Hadi Quesneville 33

URGI - Data Integration Workbench GnpIS External DB Data Mart Data Mart Data banks User files Galaxy Developer Pipelines User The URGI unit 18/11/2010 Hadi Quesneville 34

Acknowlegments URGI M. Alaux, F. Alfama, J. Amselem, N. Choisne, S. Durand, O. Inizan, V. Jamilloux, A. Keliet, E. Kimmel, N. Lapalu, I. Luyten, N. Mohellibi, C. Pommier, H. Quesneville, S. Reboux, D. Steinbach, M. Zytnicki S. Arnoux (CDD), M. Bras (CDD), B. Brault (CDD), L. Brigitte (CDD), T. Flutre (PhD), C. Hoede (CDD), H. Mors (Master2) E. Permal (Post-doc), D. Verdelet (CDD), D.Valdenaire (CDD) left URGI B. Hilseberger (CDD) The URGI unit 09/11/09 Hadi Quesneville

Partners Wheat genomics: P. Leroy, C. Ravel, E. Paux, C. Feuillet Grape genomics A.F. Adam-Blondon, M. Moroldo SNP thematic D. Brunel, F. Granier, H. McKhann C. De Poittevin Genetic resources J.M. Prosperi, P. Roumet Fungi genomics M.H. Lebrun, T. Rouxel Maize genomics M. Falque, J. Joets, A. Charcosset Colot s Lab V. Colot, I. Ahmed, A. Sarazin Tree genomics C.Plomion, C. Poittevin The URGI unit 18/11/2010 Hadi Quesneville 36