Next Generation Sequencing Data Visualization



Similar documents
Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

UGENE Quick Start Guide

GenBank, Entrez, & FASTA

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Tutorial. Reference Genome Tracks. Sample to Insight. November 27, 2015

Bioinformatics Resources at a Glance

Data formats and file conversions

Databases and mapping BWA. Samtools

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Microsoft Visual Studio Integration Guide

Version 5.0 Release Notes

Analysis of ChIP-seq data in Galaxy

IGV User Guide. User Interface Main Window. This guide describes the Integrative Genomics Viewer (IGV).

Lecture Outline. Introduction to Databases. Introduction. Data Formats Sample databases How to text search databases. Shifra Ben-Dor Irit Orr

CLC Sequence Viewer USER MANUAL

Data Visualization. Prepared by Francisco Olivera, Ph.D., Srikanth Koka Department of Civil Engineering Texas A&M University February 2004

Fast. Integrated Genome Browser & DAS. Easy. Flexible. Free. bioviz.org/igb

GenBank: A Database of Genetic Sequence Data

JOOMLA 2.5 MANUAL WEBSITEDESIGN.CO.ZA

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Analysis of NGS Data

Module 1. Sequence Formats and Retrieval. Charles Steward

Exercise with Gene Ontology - Cytoscape - BiNGO

Comparing Methods for Identifying Transcription Factor Target Genes

Fireworks 3 Animation and Rollovers

Food and Drug Administration

Hadoopizer : a cloud environment for bioinformatics data analysis

How to install and use the File Sharing Outlook Plugin

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Basic processing of next-generation sequencing (NGS) data

-> Integration of MAPHiTS in Galaxy

Tutorial for proteome data analysis using the Perseus software platform

Module 10: Bioinformatics

Searching Nucleotide Databases

Unipro UGENE User Manual Version

CDUfiles User Guide. Chapter 1: Accessing your data with CDUfiles. Sign In. CDUfiles User Guide Page 1. Here are the first steps to using CDUfiles.

Quick and Easy Web Maps with Google Fusion Tables. SCO Technical Paper

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

CRM Knowledge Base. Contents

RAST Automated Analysis. What is RAST for?

Biology asks six kinds of questions

Central Management Software CV3-M1024

A Tutorial in Genetic Sequence Classification Tools and Techniques

Instruction for IE network monitor

Biological Databases and Protein Sequence Analysis

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting

INTRODUCTION to ESRI ARCGIS For Visualization, CPSC 178

Introduction to NGS data analysis

IntelliSpace PACS 4.4. Image Enabled EMR Workflow and Enterprise Overview. Learning Objectives

Exercises for the UCSC Genome Browser Introduction

WebCenter Release notes

Content Management System QUICK START GUIDE

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Visualisation tools for next-generation sequencing

LifeScope Genomic Analysis Software 2.5

Generative Drafting. Page DASSAULT SYSTEMES. IBM Product Lifecycle Management Solutions / Dassault Systemes

Data Visualization. Brief Overview of ArcMap

Easy Manage Helpdesk Guide version 5.4

Hierarchical Clustering Analysis

Unipro UGENE Manual. Version

Investigating World Development with a GIS

Updated CellTracker software manual

NGS Data Analysis: An Intro to RNA-Seq

Quick Guide. WebNow. Description. Logging on to WebNow. Document Management System

A Web Based Software for Synonymous Codon Usage Indices

Manual. Sealer Monitor Software. Version

The Artemis Manual. Copyright by Genome Research Limited

E. coli plasmid and gene profiling using Next Generation Sequencing

WebFOCUS BI Portal: S.I.M.P.L.E. as can be

PubMed My NCBI: Saving Searches & Creating Alerts

HOW TO MAKE YOUR WEBSITE

WA2262 Applied Data Science and Big Data Analytics Boot Camp for Business Analysts. Classroom Setup Guide. Web Age Solutions Inc.

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

DNA Sequence formats

TUTORIAL 4 Building a Navigation Bar with Fireworks

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Snagit 10. Getting Started Guide. March TechSmith Corporation. All rights reserved.

Customizing Confirmation Text and s for Donation Forms

GETTING STARTED WITH COVALENT BROWSER

The Artemis Manual. Copyright by Genome Research Limited

Guide for Data Visualization and Analysis using ACSN

Next generation sequencing (NGS)

BioHPC Web Computing Resources at CBSU

PTC Integrity Integration with Microsoft Visual Studio PTC Integrity 10.8

Registering with Cisco UCM

6. If you want to enter specific formats, click the Format Tab to auto format the information that is entered into the field.

UNIVERSITY OF CALGARY Information Technologies WEBFORMS DRUPAL 7 WEB CONTENT MANAGEMENT

STAAR Assessment Management System User s Guide. STAAR Grades 3 8 and End-of-Course Assessments

AutoDWG DWGSee DWG Viewer. DWGSee User Guide

Banner Document Management Suite (BDMS) Web Access Help

OPENPROJECT. Setup Draft Notes. Draft Setup notes for Openproject

Hamline University Administrative Computing Page 1

Transcription:

Next Generation Sequencing Data Visualization GBrowse2 from GMOD Andreas Gisel Institute for Biomedical Technologies CNR Bari - Italy

GMOD is the Generic Model Organism Database project GMOD is a collection of interconnected applications and databases that biologists use as repositories and as tools. That connectivity is really the key here. There's no lack of tools, but many of these tools will be little used since the typical prospective user may not have the resources or expertise required to install the tool and connect it, in some way, to the data in hand. http://www.gmod.org/

The tutorial will show: how to display a reference sequence with feature in GBrowse2 how to display Next Generation Sequencing (NGS) mapping data. We will use Escherichia coli str. K-12 substr. DH10B, complete genome - NC_010473.1 (ref1.fa) and Illumina pair-end reads1.fa and reads2.fa

GBrowse2 http://localhost/cgi-bin/gb2/gbrowse/yeast_advanced/

GBrowse2 http://localhost/cgi-bin/gb2/gbrowse/yeast_advanced/

GBrowse2 /etc/gbrowse2/

GBrowse2 /var/lib/gbrowse2/databases Text

GBrowse2 Set-up the gbrowser for E.coli data We have: Reference sequence data in FASTA Annotation for the reference sequence in GFF

GBrowse2 Set-up the gbrowser for E.coli data We have: Reference sequence data in FASTA Annotation for the reference sequence in GFF We need: E. coli configuration file Add E.coli informatio to the general GBrowse.conf file Add data to the coresponding database

Annotation Data Formats EMBL annotation files GeneBank annotation files GFF files

GeneBank format LOCUS HQ336405 157790 bp DNA circular PLN 22-DEC-2010 DEFINITION Prunus persica chloroplast, complete genome. ACCESSION HQ336405 VERSION HQ336405.1 GI:309321413 KEYWORDS. SOURCE chloroplast Prunus persica (peach) ORGANISM Prunus persica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; fabids; Rosales; Rosaceae; Maloideae; Amygdaleae; Prunus. REFERENCE 1 (bases 1 to 157790) AUTHORS Jansen,R.K., Saski,C., Lee,S.B., Hansen,A.K. and Daniell,H. TITLE Complete Plastid Genome Sequences of Three Rosids (Castanea, Prunus, Theobroma): Evidence for At Least Two Independent Transfers of rpl22 to the Nucleus JOURNAL Mol. Biol. Evol. 28 (1), 835-847 (2011) PUBMED 20935065 REFERENCE 2 (bases 1 to 157790) AUTHORS Jansen,R.K., Saski,C., Lee,S.-B., Hansen,A.K. and Daniell,H. TITLE JOURNAL Direct Submission Submitted (28-SEP-2010) Integrative Biology, University of Texas at Austin, 1 University Station C0930, Austin, TX 78712, USA

GeneBank format FEATURES Location/Qualifiers source 1..157790 /organism="prunus persica" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /db_xref="taxon:3760" gene complement(join(99804..100597,71346..71459)) /gene="rps12" /trans_splicing CDS complement(join(99804..99829,100366..100597,71346..71459)) /gene="rps12" /trans_splicing /codon_start=1 /transl_table=11 /product="ribosomal protein S12" /protein_id="ado64999.1" /db_xref="gi:309321458" /translation="mptikqlirntrqpirnvtkspalggcpqrrgtctrvytitpkk PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL DAVGVKDRQQGRSKYGVKKPK"

GeneBank format ORIGIN 1 tgggcgaacg acgggaattg aacccgcgca tggtggattc acaatccact gccttgatcc 61 acttggctac atccgcccct tatactatta caaatattta caccatttat cattacttgt 121 aagataaaat acaacataaa ataaactgaa acttttaata ttttaattaa attttgtagt 181 aaattaacta aaaaaaaata tagaacaaaa caatatagta aagttaagta gtaaataaaa 241 aaaatactaa atagtaaagg agcaataaca aacctcttga tataacaaga aatttattat 301 tgctccttta ctttcaagaa ctcctatata ctaagaccaa agtcttatcc atttatagat 361 ggaacttcaa cagcagctag atctagaggg aaattatggg cattacgttc atgcataact 421 tccataccaa ggttagcgcg gttaataata tcagcccaag tattaattac acgaccctga 481 ctatcaacta cagattgatt gaaattaaaa ccatttaagt tgaaagccat agtgctgata 541 cctaaagcgg taaaccagat acctactaca ggccaagcag ctaggaagaa atgtaaagaa 601 cgagaattgt tgaaactagc atattggaag atcaatcggc caaaataacc atgagcggct 661 acgatattat aggtttcttc ctcttgaccg aatctgtaac cttcattagc agattcattt 721 tctgtggttt ccctgatcaa actagaggtt accaaggacc catgcatagc actgaatagg 781 gagccgccga atacaccagc tacgcctaac atgtgaaatg ggtgcataag gatgttgtgc 841 tcggcttgga atacaatcat gaagttgaaa gtaccggaga ttcctagggg cataccgtca 901 gaaaagcttc cttgaccaat tggatatatc aagaaaacag cagtagcagc tgcaacagga 961 gctgaatatg caacagcaat ccaagggcgc atacccagac ggaaactaag ttcccactca 1021 cgacccatgt agcaagctac accaagtaag aagtgtagaa caattagttc ataaggacca 1081 ccgttgtata accattcatc aacggaagcc gcttcccata tcgggtaaaa gtgcaaacct 1141 atagctgcag aggtaggaat aatggcacca gaaataatat tgtttccata aagtaaagat 1201 ccagaaacag gttcacgaat accatcaata tctactggag gtgcagcaat gaaagcaata

GFF format ##gff-version 3 # sequence-region HQ336405 1 157790 # conversion-by bp_genbank2gff3.pl # organism Prunus persica # date 22-DEC-2010 # Note Prunus persica chloroplast, complete genome. # working on region:hq336405, Prunus persica, 22-DEC-2010, Prunus persica chloroplast, complete genome. # Possible gene unflattening error withhq336405: consult STDERR HQ336405!GenBank! region! 1! 157790!.! +!.! ID=HQ336405;Dbxref=taxon:3760;Note=Prunus persica chloroplast complete genome;date=22-dec-2010;mol_type=genomic DNA;organelle=plastid:chloroplast;organism=Prunus persica HQ336405!GenBank! CDS!99804! 99829!.! -!.! ID=rps12;Dbxref=GI: 309321458;codon_start=1;gene=rps12;product=ribosomal protein S12;protein_id=ADO64999.1;trans_splicing=_no_value;transl_table=11;translation=length.123 HQ336405!GenBank! CDS!100366! 100597!.! -!.! ID=rps12;Dbxref=GI: 309321458;codon_start=1;gene=rps12;product=ribosomal protein S12;protein_id=ADO64999.1;trans_splicing=_no_value;transl_table=11;translation=length.123 HQ336405!GenBank! CDS!71346! 71459!.! -!.! ID=rps12;Dbxref=GI: 309321458;codon_start=1;gene=rps12;product=ribosomal protein S12;protein_id=ADO64999.1;trans_splicing=_no_value;transl_table=11;translation=length.123 HQ336405!GenBank! gene!99804! 100597!.! -!. ID=rps12;gene=rps12;trans_splicing=_no_value HQ336405!GenBank! gene!71346! 71459!.! -!. ID=rps12;gene=rps12;trans_splicing=_no_value HQ336405!GenBank! gene!3! 77!.! -!.! ID=trnH-GUG;gene=trnH-GUG HQ336405!GenBank! trna!3! 77!.! -!.! ID=trnH-GUG.r01;Parent=trnH GUG;Note=anticodon:GUG;gene=trnH-GUG;product=tRNA-His

GBrowse2 Create and configure the tracks visualizing the date of E.coli [GENERAL] description = Escherichia coli str. K-12 substr. DH10B database = annotations initial landmark = NC_010473.1:1..10000 # bring in the special Submitter plugin for the rubber-band select menu plugins = FastaDumper RestrictionAnnotator SequenceDumper TrackDumper Submitter S autocomplete = 1 default tracks = Genes ORFs trnas CDS Transp Centro:overview GC:region # examples to show in the introduction examples = NC_010473.1:3170000..3180000 # "automatic" classes to try when an unqualified identifier is given automatic classes = Symbol Gene Clone

GBrowse2 Create and configure the tracks visualizing the date of E.coli ################################# # database definitions ################################# [scaffolds:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor memory -dir /var/lib/gbrowse2/databases/ecoli_seq search options = default +autocomplete [annotations:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor memory -dir /var/lib/gbrowse2/databases/ecoli_annotations search options = default +autocomplete

GBrowse2 Create and configure the tracks visualizing the date of E.coli # Default glyph settings [TRACK DEFAULTS] glyph = generic database = annotations height = 8 bgcolor = cyan fgcolor = black label density = 25 bump density = 100 show summary = 99999 # go into summary mode when zoomed out to 100k # default pop-up balloon balloon hover = <b>$name</b> is a $type spanning $ref from $start to $end. Click for more details. [CDS] feature = gene glyph = cds description = 0 height = 26 sixframe = 1 label = sub {shift->name. " reading frame"} key = CDS balloon click width = 500 balloon hover width = 350 balloon hover = <b>$name</b> is a $type spanning $ref from $start to $end. Click to search Google for $name. balloon click = http://www.google.com/search?q=$name citation = This track shows CDS reading frames.

GBrowse2 Insert E.coli in the general GBrowse config file ############################################################################## # # DATASOURCE DEFINITIONS # One stanza for each configured data source # ############################################################################## [yeast] description = Yeast chromosomes 1+2 (basic) path = yeast_simple.conf [yeast_advanced] description = Yeast chromosomes 1+2 (advanced) path = yeast_chr1+2.conf [ecoli] description = Escherichia coli str. K-12 substr. DH10B path = ecoli.conf

GBrowse2 Add data to databases ################################# # database definitions ################################# [scaffolds:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor memory -dir /var/lib/gbrowse2/databases/ecoli_seq search options = default +autocomplete [annotations:database] db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor memory -dir /var/lib/gbrowse2/databases/ecoli_annotations search options = default +autocomplete Create: /var/lib/gbrowse2/databases/ecoli_seq /var/lib/gbrowse2/databases/ecoli_annotations

GBrowse2 Add data to databases Move to: /var/lib/gbrowse2/databases/ecoli_seq - ref1.fa - chromosomes.gff3 /var/lib/gbrowse2/databases/ecoli_annotations - NC_010473.gff

GBrowse2 http://localhost/cgi-bin/gb2/gbrowse/ecoli/

Create mapping data Map with bowtie: index: bowtie-build -f ref1.fa ref1 map: bowtie -n 1 -l 30 -I 0 -X 400 --un unmapped -p 2 -S ref/ ref1-1 illumina/reads1.fq -2 illumina/reads2.fq > pair.sam SAM to BAM: index: samtools faidx ref1.fa SAM to BAM: samtools import ref1.fa.fai pair.sam pair.bam sort BAM: samtools sort pair.bam pair_sorted index BAM: samtools index pair_sorted.bam

Modify the ecoli.conf Add database: [ecolisam:database] db_adaptor = Bio::DB::Sam db_args = -fasta /var/lib/gbrowse2/databases/ecolisam/ref1.fa -bam /var/lib/gbrowse2/databases/ecolisam/pair_sorted.bam search options = none Add tracks: [CoverageXyplot] feature = coverage glyph = wiggle_xyplot database = ecolisam height = 50 fgcolor = black bicolor_pivot = 20 pos_color = blue neg_color = red key = Coverage (xyplot) category = Reads label = 0 # Labels on wiggle tracks are redundant.

Create mapping database create directory ecolisam in /var/lib/gbrowse2/databases copy pair_sorted.bam and pair_sorted.bam.bai to ecolisam and set the right privileges to ecolisam so that the browser can access them

E.coli GBrowse