Accessing the 1000 Genomes Data Paul Flicek European BioinformaMcs InsMtute
Data access General informamon File access 1000 Genomes Browser Tools Where to find help
www.1000genomes.org
www.1000genomes.org
1000 Genomes Project Resources L. Clarke, H. Zheng-Bradley, R. Smith, E Kuleshea, I Toneva, B. Vaughan, P. Flicek and 1000 Genomes Consortium European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK Introduction DATA TYPE FILE FORMAT SIZE sequence FASTQ alignment BAM 56 Tbytes of BAM files variants VCF 38.9M SNPs ~4.7M short indels http://browser.1000genomes.org The 1000 Genomes project utilizes the Ensembl Browser to display our variant calls. We provide rapid access to project variant calls through the browser before they become available via dbsnp and DGVa. Tracks of 1000 genomes variants by population can be viewed in the location page: Variant Effect Predictor The predictor takes a list of variant positions and alleles, and predicts the effects of each of these on any overlapping features (transcripts, regulatory features) annotated in Ensembl. An example output is shown below: Data Slicer Many of the 1000 Genomes files are large and cumbersome to handle. The Data Slicer allows users to get data for specific regions of the genome and to avoid having to download many gigabytes of data they don t needl samples/ populations you choose. Below is the Data Slicer input interface: Discoverability The ftp site has a index updated nightly. This index is searchable from our website. http://www.1000genome.org/ftpsearch A list of variants can be obtained for any given transcript. In addition to basic information about a variant, PolyPhen and SIFT annotation are displayed to indicate the clinic significance of the variant. The search allows users to specify which ftp site to get paths to, to get md5 checksums and also filter out high volume results like bam and Allele frequency for individual variants in different populations is displayed fastq files on the Population Genetics page. We also have various routes for users to discover new data. Variation Pattern Finder The Variation Pattern Finder (VPF) allows one to look for patterns of shared variation between individuals in the a VCF file. Within a vcf file different samples have different combination of variation genotypes. The VPF looks for distinct variation combinations within a user specifed region, shared by different individuals. The VPF only on variations that functional consequences for protein coding genes such as non-synonymous coding SNPs and splice site changes. Users can Attach remote files as custom tracks. In example below, the HG00120 track is 1000 Genomes bam file added to the browser. Website http://www.1000genomes.org/announcements Twitter @1000genomes RSS http://www.1000genomes.org/announcements/rss.xml Email 1000announce@1000genomes.org Laura Clarke EBI laura@ebi.ac.uk http://browser.1000genomes.org/tools.html The project provides several tools to help users access and interpret the data provided. 43 Tbases raw sequence Sequence, alignment and variant data is made available as quickly as possible through the project ftp site. (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/). With more than 200,000 files though discovering new data can be difficult. Accessibility Visualization The main goal of the 1000 genomes project is to establish a comprehensive and detailed catalogue of human genome variations; which in turn will empower association studies to identify disease-causing genes. The project now has data and variant genotypes for more than 1000 individuals in 14 populations. The ftp site contains more than 120Tbytes of data in 200,000 files. Acknowledgements We would like to thank the Ensembl variation team for all their help, particularly Will McLaren and Graham Ritchie. Funding: The Wellcome Trust EMBL- EBI Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD UK T +44 (0) 1223 494 444 F +44 (0) 1223 494 468 http://www.ebi.ac.uk
Data access General informamon File access 1000 Genomes Browser Tools Where to find help
>p://>p.1000genomes.ebi.ac.uk >p://>p trace.ncbi.nih.gov/1000genomes/>p Site documentamon Sequences & alignments by sample ID Data sets to accompany the pilot data publicamon. Current and archive data set releases Pre release data sets and project working materials
BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 16 2009, pages 2078 2079 doi:10.1093/bioinformatics/btp352 Data formats and key tools The Sequence Alignment/Map and SAMtools Sequence analysis Heng Li 1,, Bob Handsaker 2,, Alec Wysoker 2, Tim Fennell 2, Jue Ruan 3, Nils Homer 4, Gabor Marth 5, Goncalo Abecasis 6, Richard Durbin 1, and 1000 Genome Project Data Processing Subgroup 7 1 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK, 2 Broad Institute of MIT and Harvard, Cambridge, MA 02141, USA, 3 Beijing Institute of Genomics, Chinese Academy of Science, Beijing 100029, China, 4 Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, 5 Department of Biology, Boston College, Chestnut Hill, MA 02467, 6 Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA and 7 http://1000genomes.org Received on April 28, 2009; revised on May 28, 2009; accepted on May 30, 2009 Advance Access publication June 8, 2009 Associate Editor: Alfonso Valencia BAM alignment files ABSTRACT 2 METHODS Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference 2.1 The SAM format sequences, supporting short and long reads (up to 128 Mbp) 2.1.1 Overview of the SAM format The SAM format consists of one header section and one alignment section. The lines in the header section produced by different sequencing platforms. It is flexible in style, start with character @, and lines in the alignment section do not. All lines compact in size, efficient in random access and is the format in which are TAB delimited. An example is shown in Figure 1b. alignments from the 1000 Genomes Project are released. SAMtools In SAM, each alignment line has 11 mandatory fields and a variable implements various utilities for post-processing alignments in the number of optional fields. The mandatory fields are briefly described in SAM format, such as indexing, variant caller and alignment viewer, Table 1. They must be present but their value can be a * or a zero (depending and thus provides universal tools for processing read alignments. on the field) if the corresponding information is unavailable. The optional Vol. 27 no. 15 2011, pages 2156 2158 Availability: http://samtools.sourceforge.net BIOINFORMATICS fields are presented as key-value APPLICATIONS pairs in the format of TAG:TYPE:VALUE. NOTE doi:10.1093/bioinformatics/btr330 Contact: rd@sanger.ac.uk They store extra information from the platform or aligner. For example, the RG tag keeps the read group information for each read. In combination Sequence analysis with the @RG header lines, this tag allows each read to be labeled with Advance Access publication June 7, 2011 1 INTRODUCTION metadata about its origin, sequencing center and library. The SAM format The variant specification call format gives a detailedand description VCFtools of each field and the predefined With the advent of novel sequencing technologies such as TAGs. Illumina/Solexa, AB/SOLiD and Roche/454 (Mardis, Petr Danecek 2008), a, Adam Auton 2,, Goncalo Abecasis 3, Cornelis A. Albers 1, Eric Banks 4, VCF variant files variety of new alignment tools (Langmead et al., 2009; Li et al., Mark A. DePristo 2.1.2, Extended Robert CIGAR E. Handsaker The standard CIGAR, Gerton descriptionlunter of pairwise, Gabor T. Marth 5, 2008) have been designed to realize efficient read mapping against alignment defines three operations: M for match/mismatch, I for insertion large reference sequences, including the human genome. Stephen These tools T. Sherry 6 compared, Gilean with the reference McVean 2,7 and D for, Richard deletion. TheDurbin extended CIGAR and 1000 Genomes Project generate alignments in different formats, however, complicating proposed in SAM added four more operations: N for skipped bases on Vol. 27 no. 5 2011, pages 718 719 Analysis Group downstream processing. A common alignment format that supports the reference, S for soft clipping, H for hard clipping and P BIOINFORMATICS for padding. APPLICATIONS NOTE doi:10.1093/bioinformatics/btq671 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, 2 Wellcome Trust Centre all sequence types and aligners creates a well-defined interface These support splicing, clipping, multi-part and padded alignments. Figure 1 for Human Genetics, between alignment and downstream analyses, including variant shows University examplesof ofoxford, CIGAR strings Oxford for different OX3 7BN, types of UK, alignments. Center for Statistical Genetics, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, detection, genotyping and assembly. Program Sequence in Medical analysis and Population Genetics, Broad Advance Access publication January 5, 2011 The Sequence Alignment/Map (SAM) format Institute is designed of MIT toand 2.1.3 Harvard, Binary Cambridge, Alignment/Map MA format 02141, To improve Department the performance, of Biology, we Boston College, MA 02467, 6 National achieve this goal. It supports single- and paired-end Institutes reads of Health and National designed a Center companion for format Biotechnology Binary Alignment/Map Information, (BAM), Tabix: MD which 20894, is the fast USA and retrieval 7 Department of Statistics, sequence features from generic combining reads of different types, including color University space reads of from Oxford, binary Oxford representation OX1 3TG, of SAM UK and keeps exactly the same information as AB/SOLiD. It is designed to scale to alignment sets Associate of 10 11 Editor: SAM. BAM is compressed by the BGZF library, a generic library developed more John Quackenbush TAB-delimited files by us to achieve fast random access in a zlib-compatible compressed file. base pairs, which is typical for the deep resequencing of one human An example alignment of 112Gbp of Illumina GA data requires Heng 116GBLi of individual. ABSTRACT disk space (1.0 byte per input base), including sequences, Although base qualities generic and feature format (GFF) has recently been extended In this article, we present an overview of the SAM format and Summary: The variantall call theformat meta information (VCF) is agenerated genericbyformat MAQ. for Most ofto thisstandardize space Program usedstorage in Medical Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA of variant information in genome variant briefly introduce the companion SAMtools software package. A storing DNA polymorphism store data the base such qualities. as SNPs, insertions, deletions format Associate (GVF) (Reese Editor: et Dmitrij al., 2010), Frishman this is not tailored for storing detailed format specification and the complete documentation of and structural variants, together with rich annotations. VCF is usually information across many samples. We have designed the VCF SAMtools are available at the SAMtools web site. stored a compressed2.1.4 manner Sorting and and canindexing be ASAM/BAM for fast data file can beformat unsorted, tobut besorting scalable so as to encompass millions of sites with All indexed for fast retrieval by coordinate is used to streamline data processing and to avoid ABSTRACT loading 2 METHODS retrieval of variants from a range of positions on the reference genotype data and annotations from thousands of samples. We have To whom correspondence should be addressed. extra alignments into memory. A position-sorted BAM file cansummary: be indexed. Tabix is the first generic tool that indexes position sorted genome. The format was developed for the 1000 Genomes Project, adopted a textual encoding, with complementary indexing, to allow Tabix indexing is a generalization of BAM indexing for generic TABfiles andin simple TAB-delimited formats such as GFF, BED, PSL, SAM and delimited files. It inherits all the advantages of BAM indexing, including The authors wish it to be known that, in their opinion, the first two authors We combine the UCSC binning scheme (Kent et al., 2002) should be regarded as Joint First Authors. and has also been adopted by other projects such as UK10K, easy generation of the files while maintaining fast data access. linear indexing to achieve fast random retrieval of alignments overlapping SQL export, a and quickly retrieves features overlapping specified data compression and efficient random access in terms of few seek function dbsnp and the NHLBI Exome Project. VCFtools is a software suite In this article, we present an overview of the VCF and briefly regions. Tabix features include few seek function calls per query, data calls per query. that implements various utilities for processing VCF files, including introduce the companion VCFtools software package. A detailed compression with gzip compatibility and direct FTP/HTTP access. validation, merging, comparing and also provides a general Perl API. format specification and the complete documentation of VCFtools 2009 The Author(s) Tabix is implemented as a free command-line tool as well as a library This is an Open Access article distributed under the terms Availability: of the Creative http://vcftools.sourceforge.net Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ are available at the VCFtools web site. 2.1 Sorting and BGZF compression in C, Java, Perl and Python. It is particularly useful for manually by-nc/2.0/uk/) which permits unrestricted non-commercial Contact: use, distribution, rd@sanger.ac.uk and reproduction in any medium, provided the original work is properly cited. Before being indexed, the data file needs to be sorted first by sequence name examining local genomic features on the command line and enables 2 METHODS and then by leftmost coordinate, which can be done with the standard Unix Received on October 28, 2010; revised on May 4, 2011; accepted genome viewers to support huge data files and remote custom tracks sort. The sorted file should then be compressed with the bgzip program Downloaded from bioinformatics.oxfordjournals.org at Genome Research Ltd on October 13, 2011 Downloaded from bioinformatics.oxfordjournals.org at Genome Rese
1000 Genomes is in the Amazon cloud 1KG pilot content (BAM) is available at s3://1000genomes.s3.amazonaws.com You can see the XML at hop://1000genomes.s3.amazonaws.com
Data access General informamon File access 1000 Genomes Browser Tools Where to find help
hop://browser.1000genomes.org
Gene variamon zoom
PopulaMon
SIFT PolyPhen
File upload to view with 1000 Genomes data Supports popular file types: BAM, BED, bedgraph, BigWig, GBrowse, Generic, GFF, GTF, PSL, VCF*, WIG * VCF must be indexed
Uploaded VCF Example: Comparison of August calls and /technical/working/20110502_vqsr_phase1_wgs_snps/all.wgs.phase1.projectconsensus.snps.sites.vcf.gz
1000 Genomes Browser For further informamon on the capabilimes of the browser and its use, aoend the Ensembl New Users Workshop on Saturday at 12:30
hop://pilotbrowser.1000genomes.org
Data access General informamon File access 1000 Genomes Browser Tools Where to find help
hop://browser.1000genomes.org
Tools page
Ensembl Variant Effector Predictor (VEP) Takes list of variamon and annotates with respect to Ensembl features Returns whether the SNP has been seen in the 1000 Genomes and if it has an rs number (if one has been assigned) Returns SIFT, PolyPhen and Condel scores Extensive filtering opmons by MAF and populamons Web and command line versions
Data slicer for subsets of the data
hop://trace.ncbi.nlm.nih.gov/traces/1kg_slicer/ Sliced BAM to files
VariaMon PaOern Finder hop://browser.1000genomes.org/ Homo_sapiens/UserData/VariaMonsMapVCF VCF input Discovers paoerns of Shared Inheritance Variants with funcmonal consequences considered Web output with csv and excel downloads
Access to backend Ensembl databases Public MySQL database at mysql db.1000genomes.org port 4272 Full programmamc access with Ensembl API More informamon on the use of the Ensembl API at the Ensembl Advanced Users Workshop tomorrow
Data access General informamon File access 1000 Genomes Browser Tools Where to find help
Credits & Contact Eugene Kulesha, Iliana Toneva, Bren Vaughan Will McLaren, Graham Ritchie, Fiona Cunningham Laura Clarke, Holly Zheng-Bradley, Rick Smith Steve Sherry, Chunlin Xiao For more information contact info@1000genomes.org