Accessing the 1000 Genomes Data. Paul Flicek European BioinformaMcs InsMtute



Similar documents
How-To: SNP and INDEL detection

Text file One header line meta information lines One line : variant/position

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

LifeScope Genomic Analysis Software 2.5

Next generation sequencing (NGS)

Delivering the power of the world s most successful genomics platform

Analysis of NGS Data

Module 1. Sequence Formats and Retrieval. Charles Steward

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Version 5.0 Release Notes

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Using Ensembl tools for browsing ENCODE data

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Introduction to NGS data analysis

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Next Generation Sequencing: Technology, Mapping, and Analysis

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

SAP HANA Enabling Genome Analysis

Human Genomes and Big Data Challenges QUANTITY, QUALITY AND QUANDRY Gerry Higgins, M.D., Ph.D. AssureRx Health, Inc.

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

DNA Sequencing Data Compression. Michael Chung

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

GeneProf and the new GeneProf Web Services

New solutions for Big Data Analysis and Visualization

UGENE Quick Start Guide

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Data formats and file conversions

Genomes and SNPs in Malaria and Sickle Cell Anemia

Step by Step Guide to Importing Genetic Data into JMP Genomics

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

SRA File Formats Guide

Simplifying Data Interpretation with Nexus Copy Number

Practical Guideline for Whole Genome Sequencing

Visualisation tools for next-generation sequencing

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Comparing Methods for Identifying Transcription Factor Target Genes

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

BioInterchange 2.0 CODAMONO. Integrating and Scaling Genomics Data

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

Analysis of ChIP-seq data in Galaxy

SNPbrowser Software v3.5

Integrated Rule-based Data Management System for Genome Sequencing Data

Searching Nucleotide Databases

Data Analysis for Ion Torrent Sequencing

Global Alliance. Ewan Birney Associate Director EMBL-EBI

Basic processing of next-generation sequencing (NGS) data

Databases and mapping BWA. Samtools

Archiving, Indexing and Accessing Web Materials: Solutions for large amounts of data

GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations

Processing Genome Data using Scalable Database Technology. My Background

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

A Tutorial in Genetic Sequence Classification Tools and Techniques

Top 10 Things to Know about WRDS

BioHPC Web Computing Resources at CBSU

THE UNIVERSITY OF MANCHESTER Unit Specification

Package hoarder. June 30, 2015

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Automated and Scalable Data Management System for Genome Sequencing Data

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

MiSeq: Imaging and Base Calling

NatureServe s Environmental Review Tool

NaviCell Data Visualization Python API

GWASrap User Manual v1.1

Hadoopizer : a cloud environment for bioinformatics data analysis

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

SEQUENCING INITIATIVE SUOMI (SISU) SYMPOSIUM SPEAKERS August 26, 2014

Bioinformatics Resources at a Glance

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Visualization with Excel Tools and Microsoft Azure

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Table of Contents. Introduction: 2. Settings: 6. Archive 9. Search Browse Schedule Archiving: 18

Search and Information Retrieval

Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective

Functional Requirements for Digital Asset Management Project version /30/2006

Frequently Asked Questions Next Generation Sequencing

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

Transcription:

Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute

Data access General informamon File access 1000 Genomes Browser Tools Where to find help

www.1000genomes.org

www.1000genomes.org

1000 Genomes Project Resources L. Clarke, H. Zheng-Bradley, R. Smith, E Kuleshea, I Toneva, B. Vaughan, P. Flicek and 1000 Genomes Consortium European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK Introduction DATA TYPE FILE FORMAT SIZE sequence FASTQ alignment BAM 56 Tbytes of BAM files variants VCF 38.9M SNPs ~4.7M short indels http://browser.1000genomes.org The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls. We provide rapid access to project variant calls through the browser before they become available via dbsnp and DGVa. Tracks of 1000 genomes variants by population can be viewed in the location page: Variant Effect Predictor The predictor takes a list of variant positions and alleles, and predicts the effects of each of these on any overlapping features (transcripts, regulatory features) annotated in Ensembl. An example output is shown below: Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle. The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they don t needl samples/ populations you choose. Below is the Data Slicer input interface: Discoverability The ftp site has a index updated nightly. This index is searchable from our website. http://www.1000genome.org/ftpsearch A list of variants can be obtained for any given transcript. In addition to basic information about a variant, PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant. The search allows users to specify which ftp site to get paths to, to get md5 checksums and also filter out high volume results like bam and Allele frequency for individual variants in different populations is displayed fastq files on the Population Genetics page. We also have various routes for users to discover new data. Variation Pattern Finder The Variation Pattern Finder (VPF) allows one to look for patterns of shared variation between individuals in the a VCF file. Within a vcf file different samples have different combination of variation genotypes. The VPF looks for distinct variation combinations within a user specifed region, shared by different individuals. The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes. Users can Attach remote files as custom tracks. In example below, the HG00120 track is 1000 Genomes bam file added to the browser. Website http://www.1000genomes.org/announcements Twitter @1000genomes RSS http://www.1000genomes.org/announcements/rss.xml Email 1000announce@1000genomes.org Laura Clarke EBI laura@ebi.ac.uk http://browser.1000genomes.org/tools.html The project provides several tools to help users access and interpret the data provided. 43 Tbases raw sequence Sequence, alignment and variant data is made available as quickly as possible through the project ftp site. (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/). With more than 200,000 files though discovering new data can be difficult. Accessibility Visualization The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations; which in turn will empower association studies to identify disease-causing genes. The project now has data and variant genotypes for more than 1000 individuals in 14 populations. The ftp site contains more than 120Tbytes of data in 200,000 files. Acknowledgements We would like to thank the Ensembl variation team for all their help, particularly Will McLaren and Graham Ritchie. Funding: The Wellcome Trust EMBL- EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UK T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 http://www.ebi.ac.uk

Data access General informamon File access 1000 Genomes Browser Tools Where to find help

>p://>p.1000genomes.ebi.ac.uk >p://>p trace.ncbi.nih.gov/1000genomes/>p Site documentamon Sequences & alignments by sample ID Data sets to accompany the pilot data publicamon. Current and archive data set releases Pre release data sets and project working materials

BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 16 2009, pages 2078 2079 doi:10.1093/bioinformatics/btp352 Data formats and key tools The Sequence Alignment/Map and SAMtools Sequence analysis Heng Li 1,, Bob Handsaker 2,, Alec Wysoker 2, Tim Fennell 2, Jue Ruan 3, Nils Homer 4, Gabor Marth 5, Goncalo Abecasis 6, Richard Durbin 1, and 1000 Genome Project Data Processing Subgroup 7 1 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, 2 Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, 3 Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, 4 Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, 5 Department of Biology, Boston College, Chestnut Hill, MA 02467, 6 Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and 7 http://1000genomes.org Received on April 28, 2009; revised on May 28, 2009; accepted on May 30, 2009 Advance Access publication June 8, 2009 Associate Editor: Alfonso Valencia BAM alignment files ABSTRACT 2 METHODS Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference 2.1 The SAM format sequences, supporting short and long reads (up to 128 Mbp) 2.1.1 Overview of the SAM format The SAM format consists of one header section and one alignment section. The lines in the header section produced by different sequencing platforms. It is flexible in style, start with character @, and lines in the alignment section do not. All lines compact in size, efficient in random access and is the format in which are TAB delimited. An example is shown in Figure 1b. alignments from the 1000 Genomes Project are released. SAMtools In SAM, each alignment line has 11 mandatory fields and a variable implements various utilities for post-processing alignments in the number of optional fields. The mandatory fields are briefly described in SAM format, such as indexing, variant caller and alignment viewer, Table 1. They must be present but their value can be a * or a zero (depending and thus provides universal tools for processing read alignments. on the field) if the corresponding information is unavailable. The optional Vol. 27 no. 15 2011, pages 2156 2158 Availability: http://samtools.sourceforge.net BIOINFORMATICS fields are presented as key-value APPLICATIONS pairs in the format of TAG:TYPE:VALUE. NOTE doi:10.1093/bioinformatics/btr330 Contact: rd@sanger.ac.uk They store extra information from the platform or aligner. For example, the RG tag keeps the read group information for each read. In combination Sequence analysis with the @RG header lines, this tag allows each read to be labeled with Advance Access publication June 7, 2011 1 INTRODUCTION metadata about its origin, sequencing center and library. The SAM format The variant specification call format gives a detailedand description VCFtools of each field and the predefined With the advent of novel sequencing technologies such as TAGs. Illumina/Solexa, AB/SOLiD and Roche/454 (Mardis, Petr Danecek 2008), a, Adam Auton 2,, Goncalo Abecasis 3, Cornelis A. Albers 1, Eric Banks 4, VCF variant files variety of new alignment tools (Langmead et al., 2009; Li et al., Mark A. DePristo 2.1.2, Extended Robert CIGAR E. Handsaker The standard CIGAR, Gerton descriptionlunter of pairwise, Gabor T. Marth 5, 2008) have been designed to realize efficient read mapping against alignment defines three operations: M for match/mismatch, I for insertion large reference sequences, including the human genome. Stephen These tools T. Sherry 6 compared, Gilean with the reference McVean 2,7 and D for, Richard deletion. TheDurbin extended CIGAR and 1000 Genomes Project generate alignments in different formats, however, complicating proposed in SAM added four more operations: N for skipped bases on Vol. 27 no. 5 2011, pages 718 719 Analysis Group downstream processing. A common alignment format that supports the reference, S for soft clipping, H for hard clipping and P BIOINFORMATICS for padding. APPLICATIONS NOTE doi:10.1093/bioinformatics/btq671 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, 2 Wellcome Trust Centre all sequence types and aligners creates a well-defined interface These support splicing, clipping, multi-part and padded alignments. Figure 1 for Human Genetics, between alignment and downstream analyses, including variant shows University examplesof ofoxford, CIGAR strings Oxford for different OX3 7BN, types of UK, alignments. Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, detection, genotyping and assembly. Program Sequence in Medical analysis and Population Genetics, Broad Advance Access publication January 5, 2011 The Sequence Alignment/Map (SAM) format Institute is designed of MIT toand 2.1.3 Harvard, Binary Cambridge, Alignment/Map MA format 02141, To improve Department the performance, of Biology, we Boston College, MA 02467, 6 National achieve this goal. It supports single- and paired-end Institutes reads of Health and National designed a Center companion for format Biotechnology Binary Alignment/Map Information, (BAM), Tabix: MD which 20894, is the fast USA and retrieval 7 Department of Statistics, sequence features from generic combining reads of different types, including color University space reads of from Oxford, binary Oxford representation OX1 3TG, of SAM UK and keeps exactly the same information as AB/SOLiD. It is designed to scale to alignment sets Associate of 10 11 Editor: SAM. BAM is compressed by the BGZF library, a generic library developed more John Quackenbush TAB-delimited files by us to achieve fast random access in a zlib-compatible compressed file. base pairs, which is typical for the deep resequencing of one human An example alignment of 112Gbp of Illumina GA data requires Heng 116GBLi of individual. ABSTRACT disk space (1.0 byte per input base), including sequences, Although base qualities generic and feature format (GFF) has recently been extended In this article, we present an overview of the SAM format and Summary: The variantall call theformat meta information (VCF) is agenerated genericbyformat MAQ. for Most ofto thisstandardize space Program usedstorage in Medical Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA of variant information in genome variant briefly introduce the companion SAMtools software package. A storing DNA polymorphism store data the base such qualities. as SNPs, insertions, deletions format Associate (GVF) (Reese Editor: et Dmitrij al., 2010), Frishman this is not tailored for storing detailed format specification and the complete documentation of and structural variants, together with rich annotations. VCF is usually information across many samples. We have designed the VCF SAMtools are available at the SAMtools web site. stored a compressed2.1.4 manner Sorting and and canindexing be ASAM/BAM for fast data file can beformat unsorted, tobut besorting scalable so as to encompass millions of sites with All indexed for fast retrieval by coordinate is used to streamline data processing and to avoid ABSTRACT loading 2 METHODS retrieval of variants from a range of positions on the reference genotype data and annotations from thousands of samples. We have To whom correspondence should be addressed. extra alignments into memory. A position-sorted BAM file cansummary: be indexed. Tabix is the first generic tool that indexes position sorted genome. The format was developed for the 1000 Genomes Project, adopted a textual encoding, with complementary indexing, to allow Tabix indexing is a generalization of BAM indexing for generic TABfiles andin simple TAB-delimited formats such as GFF, BED, PSL, SAM and delimited files. It inherits all the advantages of BAM indexing, including The authors wish it to be known that, in their opinion, the first two authors We combine the UCSC binning scheme (Kent et al., 2002) should be regarded as Joint First Authors. and has also been adopted by other projects such as UK10K, easy generation of the files while maintaining fast data access. linear indexing to achieve fast random retrieval of alignments overlapping SQL export, a and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function dbsnp and the NHLBI Exome Project. VCFtools is a software suite In this article, we present an overview of the VCF and briefly regions. Tabix features include few seek function calls per query, data calls per query. that implements various utilities for processing VCF files, including introduce the companion VCFtools software package. A detailed compression with gzip compatibility and direct FTP/HTTP access. validation, merging, comparing and also provides a general Perl API. format specification and the complete documentation of VCFtools 2009 The Author(s) Tabix is implemented as a free command-line tool as well as a library This is an Open Access article distributed under the terms Availability: of the Creative http://vcftools.sourceforge.net Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ are available at the VCFtools web site. 2.1 Sorting and BGZF compression in C, Java, Perl and Python. It is particularly useful for manually by-nc/2.0/uk/) which permits unrestricted non-commercial Contact: use, distribution, rd@sanger.ac.uk and reproduction in any medium, provided the original work is properly cited. Before being indexed, the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables 2 METHODS and then by leftmost coordinate, which can be done with the standard Unix Received on October 28, 2010; revised on May 4, 2011; accepted genome viewers to support huge data files and remote custom tracks sort. The sorted file should then be compressed with the bgzip program Downloaded from bioinformatics.oxfordjournals.org at Genome Research Ltd on October 13, 2011 Downloaded from bioinformatics.oxfordjournals.org at Genome Rese

1000 Genomes is in the Amazon cloud 1KG pilot content (BAM) is available at s3://1000genomes.s3.amazonaws.com You can see the XML at hop://1000genomes.s3.amazonaws.com

Data access General informamon File access 1000 Genomes Browser Tools Where to find help

hop://browser.1000genomes.org

Gene variamon zoom

PopulaMon

SIFT PolyPhen

File upload to view with 1000 Genomes data Supports popular file types: BAM, BED, bedgraph, BigWig, GBrowse, Generic, GFF, GTF, PSL, VCF*, WIG * VCF must be indexed

Uploaded VCF Example: Comparison of August calls and /technical/working/20110502_vqsr_phase1_wgs_snps/all.wgs.phase1.projectconsensus.snps.sites.vcf.gz

1000 Genomes Browser For further informamon on the capabilimes of the browser and its use, aoend the Ensembl New Users Workshop on Saturday at 12:30

hop://pilotbrowser.1000genomes.org

Data access General informamon File access 1000 Genomes Browser Tools Where to find help

hop://browser.1000genomes.org

Tools page

Ensembl Variant Effector Predictor (VEP) Takes list of variamon and annotates with respect to Ensembl features Returns whether the SNP has been seen in the 1000 Genomes and if it has an rs number (if one has been assigned) Returns SIFT, PolyPhen and Condel scores Extensive filtering opmons by MAF and populamons Web and command line versions

Data slicer for subsets of the data

hop://trace.ncbi.nlm.nih.gov/traces/1kg_slicer/ Sliced BAM to files

VariaMon PaOern Finder hop://browser.1000genomes.org/ Homo_sapiens/UserData/VariaMonsMapVCF VCF input Discovers paoerns of Shared Inheritance Variants with funcmonal consequences considered Web output with csv and excel downloads

Access to backend Ensembl databases Public MySQL database at mysql db.1000genomes.org port 4272 Full programmamc access with Ensembl API More informamon on the use of the Ensembl API at the Ensembl Advanced Users Workshop tomorrow

Data access General informamon File access 1000 Genomes Browser Tools Where to find help

Credits & Contact Eugene Kulesha, Iliana Toneva, Bren Vaughan Will McLaren, Graham Ritchie, Fiona Cunningham Laura Clarke, Holly Zheng-Bradley, Rick Smith Steve Sherry, Chunlin Xiao For more information contact info@1000genomes.org