Databases, data mining and analysis pipelines Part 5: BioMarts

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Databases, data mining and analysis pipelines Part 5: BioMarts"

Transcription

1 Databases, data mining and analysis pipelines Part 5: BioMarts Amel GHOUILA, PhD LTCII, Institut Pasteur de Tunis May, Amel GHOUILA (IPT) 1 May, / 35

2 Plan 1 Introduction 2 BioMarts 3 Biomart on line 4 BiomaRt package 5 Data mining tools Amel GHOUILA (IPT) 2 May, / 35

3 Introduction Plan 1 Introduction 2 BioMarts 3 Biomart on line 4 BiomaRt package 5 Data mining tools Amel GHOUILA (IPT) 3 May, / 35

4 Introduction Biological Data Management Biological data management is a challenging task Biological concepts are complex and not always well defined Amel GHOUILA (IPT) 4 May, / 35

5 Introduction Biological Web Databases Amel GHOUILA (IPT) 5 May, / 35

6 Introduction Biological Data Management Biological data management is a challenging task Biological concepts are complex and not always well defined Amel GHOUILA (IPT) 6 May, / 35

7 BioMarts Plan 1 Introduction 2 BioMarts 3 Biomart on line 4 BiomaRt package 5 Data mining tools Amel GHOUILA (IPT) 7 May, / 35

8 BioMarts What is Biomart A Biomart is a Datamart an indexing and extraction system a data mart is a subset of the data warehouse the information in the database is not organized in a way that makes it easy for organizations to find what they need Amel GHOUILA (IPT) 8 May, / 35

9 BioMarts The BioMart project A joint project between : European Bioinformatics Institute (EBI) Cold Spring Harbor Laboratory (CSHL) The BioMart project ( was initiated to adress manny challenges The BioMart software is based on two fundamentals concepts : data agnostic modelling data federation Amel GHOUILA (IPT) 9 May, / 35

10 BioMarts The BioMart project Data agnostic modelling modelling simplifies the difficult and time-consuming task of data modelling using a predefined, query-optimized relational schema that can be used to represent any kind of data Amel GHOUILA (IPT) 9 May, / 35

11 BioMarts The BioMart project Data federation organization of multiple disparate and distributed database systems into what appears to be a single integrated virtual database possibilty of accessing and cross reference data from many data sources using a single user interface without the need of collation the data into one location Amel GHOUILA (IPT) 9 May, / 35

12 BioMarts Advantages of the BioMart project Data mining or advance search adapting data warehousing ideas to create a universal system for Biological data Management and gives biologists the ability to create complex, customized datasets through a web interface new innovative way of creating large multi-database repositories that avoid the need to store all the data in a single location unified access to disparate, geographically distributed data sources proves that large-scale projects involving NGS data can be managed efficiently in a distributed environment interactive several levels of query optimization to efficiently manage large data sets Amel GHOUILA (IPT) 10 May, / 35

13 BioMarts BioMart idea Amel GHOUILA (IPT) 11 May, / 35

14 BioMarts Building BioMart databases Amel GHOUILA (IPT) 12 May, / 35

15 BioMarts Fixed schema transformation Amel GHOUILA (IPT) 13 May, / 35

16 BioMarts BioMart architecture Amel GHOUILA (IPT) 14 May, / 35

17 Biomart on line Plan 1 Introduction 2 BioMarts 3 Biomart on line 4 BiomaRt package 5 Data mining tools Amel GHOUILA (IPT) 15 May, / 35

18 Biomart on line Example : Ensembl Martview http :// Ensembl- Tools- BioMart : No programming required! Amel GHOUILA (IPT) 16 May, / 35

19 Biomart on line Example : Ensembl Martview Different queries All the genes of a given species only genes on one specific region of a chromosome InterPro domain associated to a chromosome or to one region of a chromosome Gene Ontology and expression vocabulary terms Multi species : orthologs and upstream regions etc. Amel GHOUILA (IPT) 16 May, / 35

20 Biomart on line Example : Ensembl Martview Steps Choose the Dataset Filters : define the set of genes Determine output columns Export results Amel GHOUILA (IPT) 16 May, / 35

21 Biomart on line Examples of other databases with Biomarts interfaces Many databases adpoted Biomart : dbsnp (via Ensembl) HapMap SequenceMart : Ensembl genome seqences wormbase Reactome Amel GHOUILA (IPT) 17 May, / 35

22 Biomart on line Biomarts user interfaces Martview : web based interface : Possibility to query all databases hosted by EBI s public biomart server MartExplorer BiomaRt R/bioconductor package Amel GHOUILA (IPT) 18 May, / 35

23 BiomaRt package Plan 1 Introduction 2 BioMarts 3 Biomart on line 4 BiomaRt package 5 Data mining tools Amel GHOUILA (IPT) 19 May, / 35

24 BiomaRt package BiomaRt package R interface to BioMart databases Developed by Steffen Durinck (started Feb 2005) Well suited for batch querying Main sets of functions Adapted to Ensembl with available shortcuts for FAQs : getgene, getgo, getomim Generic queries, modeled after MQL (Mart Query language) : can be used with any Biomart dataset Amel GHOUILA (IPT) 20 May, / 35

25 BiomaRt package BiomaRt package Advantages Possibility to retrieve large amount of data from various sources Uniform way without the need to know the underlying database schemas Avoiding writting complex SQL queries Amel GHOUILA (IPT) 20 May, / 35

26 BiomaRt package BiomaRt package Communication protocols Direct MySQL queries to BioMart database servers HTTP queries to BioMart webservices Amel GHOUILA (IPT) 20 May, / 35

27 BiomaRt package Getting started with biomart Install > R() >source( http ://bioconductor.org/bioclite.r ) > bioclite( biomart ) Loading required package : XML Amel GHOUILA (IPT) 21 May, / 35

28 BiomaRt package Getting started with biomart Advantages > library( biomart ) > ListMarts() : lists all available databases Amel GHOUILA (IPT) 21 May, / 35

29 BiomaRt package Getting started with biomart Selection of a database usemart() : Selects a specific BioMart database to be used Use Ensembl BioMart database : > Ensembl= usemart( ensembl ) The database choosen must be a valid name given by listmarts > listdatasets(ensembl) : Check for Datasets available in the selected BioMart Amel GHOUILA (IPT) 21 May, / 35

30 BiomaRt package Mining Ensembl data > ens - usemart( ensembl ) then choose a database to use : > hsap - usedataset( hsapiens gene ensembl,mart=ens) OR > hsap = usemart( ensembl, dataset= hsapiens gene ensembl ) Amel GHOUILA (IPT) 22 May, / 35

31 BiomaRt package Mining Ensembl data getgene function Example queries the database for gene information It accepts many forms of gene identifier : Entrez, HUGO, Affy transcript returns : Gene symbol, Description, Chromosome name, Band, Start position, End position, BioMartID > getgene(id=100, type= entrezgene, mart=hsap) Amel GHOUILA (IPT) 22 May, / 35

32 BiomaRt package Mining Ensembl data getbm function more general that getgene specifies a list of filters for selecting genes or SNPs and attributes to return from the database Syntax : getbm(attributes, filters=, values=, mart, list.names = NULL, checkfilters = TRUE, uniquerows = TRUE) Amel GHOUILA (IPT) 22 May, / 35

33 BiomaRt package Mining Ensembl data Main arguments attributes : Attributes you want to retrieve. listattributes listattributes(ensembl) Filters : (one or more) define a restriction in the query. listfilters function shows all available Filters on a given Dataset (listfilters(ensembl)) values : a vector of values for the filters mart Amel GHOUILA (IPT) 22 May, / 35

34 BiomaRt package Mining Ensembl data Examples Retrieve GO annotation for the following Illumina human wg6 v2 identifiers : ILMN , ILMN > illuminaids= c ( ILMN , ILMN ) : specifying filters > goannot= getbm (c( Illumina human wg6 v2, go id ), filters= Illumina human wg6 v2, values= illuminaids, mart=hsap) Amel GHOUILA (IPT) 22 May, / 35

35 BiomaRt package Other functions getgo : Go id, GO term getomim (Online Mendelian Inheritance in Man, a catalogue of human genes and genetic disorders) : OMIM id, Disease, BioMart id getinterpro : (Interpro is the metabase gathering protein domains information) Interpro id, description getsequence : Retrieves a sequence getsnp gethomolog Amel GHOUILA (IPT) 23 May, / 35

36 BiomaRt package Retrieve sequences Available sequences types in Ensembl Exon Coding sequence protein sequences 3 UTR, 5 UTR Amel GHOUILA (IPT) 24 May, / 35

37 BiomaRt package Retrieve sequences arguments of getseq function Example id : identifier type : type of identifier used : hgnc symbol or affy hg u133 plus 2, etc seqtype : sequence type that needs to be retrieved > agt -getsequence(id= AGT,type= hgnc symbol, seqtype= peptide,mart=ensembl) Retrieve all exons of CDH1 : > seq -getsequence(id= CDH1, type= hgnc symbol, seqtype= gene exon, mart=ensembl) Amel GHOUILA (IPT) 24 May, / 35

38 BiomaRt package Combination of marts and homology detection example getlds function combines two data marts useful for homologous detection Amel GHOUILA (IPT) 25 May, / 35

39 BiomaRt package Combination of marts and homology detection example Example the mouse equivalents of a particular Affy transcript, or of the NOX1 gene > human = usemart( ensembl, dataset = hsapiens gene ensembl ) > mouse = usemart( ensembl, dataset = mmusculus gene ensembl ) > getlds(attributes = c( hgnc symbol, chromosome name, start position ), filters = hgnc symbol, values = NOX1, mart = human,attributesl = c( chromosome name, start position, external gene id ),martl = mouse) Amel GHOUILA (IPT) 25 May, / 35

40 BiomaRt package Useful links Bioinformatics resources OBRC : Online Bioinformatics Resources Collection : http :// Biostar : A high quality question and answer Web site SEQanswers : A discussion and information site for next-generation sequencing http ://omictools.com/ : An informative directory for multi-omic data analysis Rosalind (http ://rosalind.info/) : Platform for learning bioinformatics through problem solving http :// : Guide to Selected Bioinformatics Internet Resources Amel GHOUILA (IPT) 26 May, / 35

41 Data mining tools Plan 1 Introduction 2 BioMarts 3 Biomart on line 4 BiomaRt package 5 Data mining tools Amel GHOUILA (IPT) 27 May, / 35

42 Data mining tools Tools needed for analysis Amel GHOUILA (IPT) 28 May, / 35

43 Data mining tools Tools needed for analysis make sense of all data generated huge amounts of biological data Available Analyse the data to extract new knowledge : Data Mining Vizualisation tools Amel GHOUILA (IPT) 28 May, / 35

44 Data mining tools Data Mining What is data mining? The extraction of knowledge from large amounts of data (han and Kamber, 2006) the automatic process of discovering patterns in data patterns discovered must be meaningful Amel GHOUILA (IPT) 29 May, / 35

45 Data mining tools Data Mining Amel GHOUILA (IPT) 29 May, / 35

46 Data mining tools Data mining techniques various machine learning techniques data selection and cleaning process used to deal with different biological data for discovering new knowledge that can be translated into clinical applications handling noisy and incomplete data and integrating various data sources, are new challenges faced by biologists in the post-genome era Amel GHOUILA (IPT) 30 May, / 35

47 Data mining tools Data mining techniques Amel GHOUILA (IPT) 30 May, / 35

48 Data mining tools Supervised learning Definition inferring a function from labeled training data A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples Amel GHOUILA (IPT) 31 May, / 35

49 Data mining tools Supervised learning Amel GHOUILA (IPT) 31 May, / 35

50 Data mining tools Unsupervised learning Definition trying to find hidden structure in unlabeled data many approaches to unsupervised learning : k-means, neural networks, hierarchical clustering), hidden Markov models, etc. Amel GHOUILA (IPT) 32 May, / 35

51 Data mining tools Unsupervised learning Amel GHOUILA (IPT) 32 May, / 35

52 Data mining tools Unsupervised learning vs Supervised learning Amel GHOUILA (IPT) 33 May, / 35

53 Data mining tools Data mining tools Weka collection of machine learning algorithms for data mining tasks Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization http :// R packages various packages decision trees, Kmeans, Hierarchical clustering, visualization packages, etc.. listing of R packages for data mining http :// Amel GHOUILA (IPT) 34 May, / 35

54 Data mining tools Data mining tools Amel GHOUILA (IPT) 34 May, / 35

55 Data mining tools You did it! Amel GHOUILA (IPT) 35 May, / 35

56 Data mining tools You did it! Amel GHOUILA (IPT) 35 May, / 35