DGVa Submission Template Accompanying Notes



Similar documents
Submitting Data to ISCA and NCBI

This fact sheet describes how genes affect our health when they follow a well understood pattern of genetic inheritance known as autosomal recessive.

Next Generation Sequencing: Technology, Mapping, and Analysis

Simplifying Data Interpretation with Nexus Copy Number

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Work Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction

NIH Genomic Data Sharing (GDS) Policy Guidance Memo #2 1

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Genomes and SNPs in Malaria and Sickle Cell Anemia

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Genetics Module B, Anchor 3

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Single Nucleotide Polymorphisms (SNPs)

Delivering the power of the world s most successful genomics platform

G E N OM I C S S E RV I C ES

MCB41: Second Midterm Spring 2009

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Overview of Genetic Testing and Screening

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

CHROMOSOMES AND INHERITANCE

The human gene encoding Glucose-6-phosphate dehydrogenase (G6PD) is located on chromosome X in cytogenetic band q28.

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

Mitochondrial DNA Analysis

Fact Sheet 14 EPIGENETICS

Forensic DNA Testing Terminology

Data Analysis for Ion Torrent Sequencing

NIH s Genomic Data Sharing Policy

How many of you have checked out the web site on protein-dna interactions?

Title: Genetics and Hearing Loss: Clinical and Molecular Characteristics

BioBoot Camp Genetics

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Influence of Sex on Genetics. Chapter Six

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Roberto Ciccone, Orsetta Zuffardi Università di Pavia

Becker Muscular Dystrophy

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

NATIONAL GENETICS REFERENCE LABORATORY (Manchester)

The Human Genome Project

Mendelian inheritance and the

Practice Problems 4. (a) 19. (b) 36. (c) 17

Information leaflet. Centrum voor Medische Genetica. Version 1/ Design by Ben Caljon, UZ Brussel. Universitair Ziekenhuis Brussel

MRC-Holland MLPA. Description version 12;

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

17. A testcross A.is used to determine if an organism that is displaying a recessive trait is heterozygous or homozygous for that trait. B.

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

The Human Genome Project. From genome to health From human genome to other genomes and to gene function Structural Genomics initiative

Name: Class: Date: ID: A

The following chapter is called "Preimplantation Genetic Diagnosis (PGD)".

CHROMOSOMES Dr. Fern Tsien, Dept. of Genetics, LSUHSC, NO, LA

Custom TaqMan Assays For New SNP Genotyping and Gene Expression Assays. Design and Ordering Guide

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

Chapter 13: Meiosis and Sexual Life Cycles

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

School of Nursing. Presented by Yvette Conley, PhD

14.3 Studying the Human Genome

Personal Genome Sequencing with Complete Genomics Technology. Maido Remm

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

Sequencing and microarrays for genome analysis: complementary rather than competing?

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Biology Final Exam Study Guide: Semester 2

Terms: The following terms are presented in this lesson (shown in bold italics and on PowerPoint Slides 2 and 3):

Genetic Mutations. Indicator 4.8: Compare the consequences of mutations in body cells with those in gametes.

About The Causes of Hearing Loss

Genetics Lecture Notes Lectures 1 2

escience and Post-Genome Biomedical Research

Bioinformatics Resources at a Glance

A Primer of Genome Science THIRD

Breast cancer and the role of low penetrance alleles: a focus on ATM gene

A map of human genome variation from population-scale sequencing

The correct answer is c A. Answer a is incorrect. The white-eye gene must be recessive since heterozygous females have red eyes.

Bio EOC Topics for Cell Reproduction: Bio EOC Questions for Cell Reproduction:

The Variant Call Format (VCF) Version 4.2 Specification

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Heredity - Patterns of Inheritance

Searching Nucleotide Databases

Human Genome and Human Genome Project. Louxin Zhang

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Genome 361: Fundamentals of Genetics and Genomics Fall 2015

Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

European registered Clinical Laboratory Geneticist (ErCLG) Core curriculum

Step by Step Guide to Importing Genetic Data into JMP Genomics

CNV Univariate Analysis Tutorial

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

Clinical Genetic and Clinical Genomic Data Standards for Healthcare IT

Cancer Genomics: What Does It Mean for You?

Commonly Used STR Markers

MRC-Holland MLPA. Related SALSA MLPA probemix P091 CFTR: contains probes for the CFTR gene, related to chronic pancreatitis.

Molecular and Cell Biology Laboratory (BIOL-UA 223) Instructor: Ignatius Tan Phone: Office: 764 Brown

Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

The Developing Person Through the Life Span 8e by Kathleen Stassen Berger

MUTATION, DNA REPAIR AND CANCER

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Databases and platforms for data analysis from NGS of MTB

Got Lactase? The Co-evolution of Genes and Culture

Replacing TaqMan SNP Genotyping Assays that Fail Applied Biosystems Manufacturing Quality Control. Begin

Genetics Test Biology I

Transcription:

DGVa Submission Template Accompanying Notes Submission data is grouped into the following sections (worksheets): 1) CONTACTS Contact information for the submitter and study authors. 2) STUDY General information about the study. 3) SAMPLESETS Describes the grouping of subjects and samples within the study. 4) SAMPLES Describes the subjects and samples used in the study. 5) EXPERIMENTS Describes the experiments employed to call and validate variants. The preceding Info-EXPERIMENTS worksheet provides information and instructions for completing the EXPERIMENTS worksheet. 6) VARIANT_CALLS Describes the individual variation in subjects / samples. The preceding Info-VARIANT CALLS worksheet provides information and instructions for completing the VARIANT CALLS worksheet. 7) VARIANT_REGIONS Describes genomic regions containing structural variation based on variation observed in samples (variant calls). The preceding Info- VARIANT REGIONS worksheet provides information and instructions for completing the VARIANT_REGIONS worksheet. Use this sheet only if you wish to link a number of calls to a single region on the genome. If you do not provide VARIANT_REGION information, DGVa will create a variant region for each variant call that is supplied in the VARIANT_CALLS worksheet. Pag 1

Each worksheet contains a number of fields: Submission of information for fields highlighted in grey is required. A lighter shade of grey indicates a field is required only in certain circumstances. Completion of the remaining fields is optional, however please provide as much information as you can. Please avoid the use of non-ascii characters. The file also contains 2 other worksheets, Info-CV and Info-Links. Info-CV contains the controlled vocabulary required for some fields, and Info-Links describes the format for fields referencing other resources. TAB-DELIMITED TEXT FILES The spreadsheet may be replaced by tab-delimited text files. The name of the file should be identical to that of the corresponding worksheet. Use the worksheet's field names as the header row, and include only columns for which you provide data. GENOTYPES If you wish to submit genotypes (either a part of a discovery study or for variant regions that have previously been accessioned, please contact the DGVa helpdesk dgva-helpdesk@ebi.ac.uk Email your completed submission spreadsheet and/or files to dgva-admin@ebi.ac.uk Pag 2

CONTACTS Provide information on the submitter, first author, and senior author. Information for additional contacts is optional. FIELD contact_role first_name last_name contact_email contact_phone affiliation_name affiliation_address affiliation_url NOTES The role of the contact in the study. Choose a term from the contact_role CV. The first / given name of the conatct. The last / family name of the contact. The emaiil address of the contact. The telephone number of the contact. the name of the contact's affiliation. The postal address of the contact's affiliation. The url of the affiliation's website. Controlled vocabulary for CONTACTS fields: contact_role Submitter First author Senior author Curator Other Pag 3

STUDY FIELD study_id NOTES A study identifier in the format AuthorYear where Author is the last name of the first author and Year is the year (or anticipated year) of publication, e.g. Smith2012. study_description A short description (<700 characters) of the study. This will be shown on DGVa's study display page. A comma-delimited list of alternative names by which the study will be indexed, e.g. 1000 Genomes, Thousand study_alias Genomes. study_type taxonomy_id project_id pubmed_id study_url EGA_id hold_date Choose a term from the study_type CV. Case-Control = comparison between healthy and diseased samples, Case-Set = disease-associated variation, Collection = an arbitrary collection of variations, Control Set = variation within normal samples, Curated = compiled from the literature or online resources, Somatic = variation in somatic cells only, Tumour vs. matched normal = variation between tumour and healthy samples from the same individual. The NCBI taxonomy identifier(s), as a comma-delimited list, for the species studied. To find a taxonomy ID, go to http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy, click the link for the species and look for "Taxonomy ID=". The BioProject identifier. ** REQUIRED if the study has been published. The NCBI Pubmed ID(s), as a comma-delimited list. If the study has a website, provide the URL. If the study has been submitted to the European Genome-phenome Archive (EGA), provide the EGA study accession number. If you wish to delay public release of the study data, enter a date for public release, in the format YYYY-MM-DD. Controlled vocabulary for STUDY fields: study_type Case-Control Case-Set Collection Control Set Somatic Tumor vs. Matched- Normal Pag 4

SAMPLESETS FIELD NOTES sampleset_id sampleset_type A unique numeric identifier for the sampleset. ** REQUIRED if the study_type = Case Control or Tumour vs. Matched Normal (study_type on STUDY worksheet=case Control). Specify if this sampleset is a Case or Control set. Choose a term from the sampleset_type CV. There must be at least one Case Sampleset and at least one Control Sampleset. sampleset_name The name of the sampleset. sampleset_size The total number of samples in the sampleset. sampleset_descriptio A brief description (<100 words) of the sampleset. n sampleset_phenotype The phenotype(s) of the sampleset, using one or more identifiers from a recognised vocabulary. The required format is described in the Phenotype section of the Info-Links worksheet. Use a comma-delimited list for multiple entries. Text descriptions are not allowed. sampleset_population Indicate the sampleset's: ethnicity or population (if human), strain or breed (if non-human). sampleset_sex Choose a term from the sampleset_sex CV. sampleset_taxonomy The NCBI taxonomy identifier of the species in this sampleset. To find a taxonomy ID, go to _id http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy, click the link for the species and look for "Taxonomy ID=". Controlled vocabulary for SAMPLESETS fields: sampleset_sex sampleset_type Female Case Male Control Unknown Pag 5

SAMPLES A subject can provide more than 1 sample. For example, X is the subject from whom 2 samples were investigated - the normal liver cells and the cancer liver cells. The subject:sample ratio is therefore 1:2. FIELD NOTES subject_id A unique identifier for the subject. Use the commonly used identifier if the subject is known within the public domain e.g. GM12878 subject_collection subject_population subject_taxonomy_id subject_sex subject_age subject_age_unit subject_phenotype If the subject is included in any of these public collections, choose a term from the subject_collection CV. The subject's: ethnicity or population (if human), strain or breed (if non-human). The NCBI taxonomy identifier of the subject. To find a taxonomy ID, go to http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy, click the link for the species and look for "Taxonomy ID=". Choose a term from the subject_sex CV. The current age of the subject. Use a numerical value only, e.g. 40 (the unit is specified in the subject_age_unit field below.) ** REQUIRED if subject_age is provided. The unit (singular) of the subject's age. Choose a term from the subject_age_unit CV. Units prefixed with 'G' refer to Gestational age unit. The phenotype(s) of the subject, using one or more identifiers from a recognised vocabulary. The required format is described in the Phenotype section of the Info-Links worksheet. Use a comma-delimited list for multiple entries. Text descriptions are not allowed. subject_karyotype The subject's karyotype in ISCN notation, e.g. 47,XXY or 46,XY,t(12;16)(q13;p11) (see http://www.currentprotocols.com/protocol/hga04c) subject_maternal_id subject_paternal_id sample_id sampleset_id The subject_id of the subject's mother. The subject_id of the subject's father. A unique identifier for the sample or the commonly used public identifier, e.g. NA12878. The identifier(s) of the sampleset(s) (sampleset_id from SAMPLESETS worksheet) to which the sample belongs, as a comma-delimited list. sample_resource sample_type sample_cancer sample_karyotype If sample information has been deposited to a samples database (e.g. BioSD), or is commercially available (e.g. from the Jackson Laboratory) provide the details. The required format is described in the Sample Resource section of the Info-Links worksheet. A brief description of the cell and/or tissue from which the sample was derived, e.g. Whole blood, B-Lymphocyte, BAC library, Published sequence, Brain, etc. Enter 'Yes' if the sample was derived from cancerous cells or tissue. The sample's karyotype in ISCN notation, e.g. 47,XXY or 46,XY,t(12;16)(q13;p11) (see http://www.currentprotocols.com/protocol/hga04c) sample_attributes Other relevant attributes of the sample, e.g, these might include histological data or the results of biological assays. Identifiers (as for subject_phenotype) can also be used here. Separate all terms and text descriptions with a comma. Pag 6

Controlled vocabulary for SAMPLES fields: subject_collection subject_sex subject_age_unit Autism Female Day CEPH Male Week HapMap Unknown Month HGDP Year NINDS GDay PDR GWeek 1000 Genomes GMonth Pag 7

EXPERIMENTS Describe each experimental procedure the generated and /validated variant calls. Provide an EXPERIMENTS record for each procedure carried out, even if the experiment was a replicate experiment. Experiments carried out at different sites can be recorded using the 'site' field. Experiments can be merged, for example if the calls were made by merging the results of 2 different calling algorithms. FIELD NOTES experiment_id A unique numeric identifier for each experiment (e.g. 1, 2, 3 etc.) experiment_resol An estimate (in KB) of the resolution (do not append 'KB'). Acceptable qualifiers include >, <, =, and ~. should be ution expressed as >=, should be expressed as <=. Basepair resolution should be reported as BP. experiment_alpha The significance level relating to confidence intervals surrounding genomic co-ordinates of variants identified by this experiment, e.g. use 0.05 to indicate a 95% confidence interval. merged_experime nt_ids method_type method_platform method_descripti on analysis_type analysis_descript ion reference_type reference_value detection_metho d detection_descrip tion site ** REQUIRED if this experiment merges two or more experiments. The experiment identification numbers (experiment_id above), as a comma-delimited list, of all experiments that were merged to this experiment. Also ensure that method_type, method_platform and analysis_type=merging. The method type. Choose a term from the method_type CV. If the platform is entered in a public database (e.g. ArrayExpress) provide the details. The required format is described in the Archive section of the Info-Links worksheet. Otherwise, provide the name, e.g. Illumina HiSeq 2000, Roche NimbleGen 42M. A brief description of the method. The type of analysis performed on data generated by the method. Choose a term from the analysis_type CV. ** REQUIRED if analysis_type = Other. A brief description of the analysis applied to data generated by the method. Indicate the type of reference data used. Choose a term from the reference_type CV. Use 'Other' if none of the above (e.g. Affymetrix's baseline data) or if method_type=merging. Additional information about the reference_type. For Genome, provide the assembly name, for Sample or Sampleset provide the sample_id(s) or sampleset_id(s) as a comma-delimited list. For Other, provide a brief description, e.g. Affymetrix's baseline data, or Merged experiment. The name(s) of the variant detection software (incuding parameters) e.g. PEMer, as a comma-delimited list. Aditional details of the detection method used. For use in multi-centre studies. The site where this experimental procedure was performed. Pag 8

external_links library_name library_url If raw data generated by this experiment has been deposited elsewhere (e.g. ENA), provide details. The required format is described in the Archive section of the Info-Links worksheet. Use a comma-delimited list for more than one. If this experiment sequenced a library from CloneDB, please enter the library name(s), as a comma-delimited list (see http://www.ncbi.nlm.nih.gov/clone/library/genomic/). If the library is not listed, please contact dbvar@ncbi.nlm.nih.gov. If sequences are available via a Trace search, please enter a TRACE:search string - e.g., TRACE:SPECIES_CODE='HOMO SAPIENS' AND CENTER_PROJECT='ABC7'. Otherwise please enter a URL from which the library sequences can be downloaded. Use a comma-delimited list if necessary. Pag 9

Controlled vocabulary for EXPERIMENTS fields: method_type analysis_type reference_type BAC acgh Curated Assembly Curated BAC assembly Control tissue Digital array de novo and local sequence assembly Other FISH de novo sequence assembly Sampleset Gene expression array Genotyping Sample Karyotyping Local sequence assembly MAPH Manual observation MassSpec Merging Merging MCD analysis Microsatellite genotyping One end anchored assembly MLPA Optical mapping Multiple complete digestion Other Oligo acgh Paired-end mapping Optical mapping Probe signal intensity PCR Read depth qpcr Read depth and paired-end mapping ROMA Sequence alignment RT-PCR SNP genotyping analysis Sequencing Split read and paired-end mapping SNP array Split read mapping Southern Western Pag 10

6) VARIANT_CALLS FIELD variant_call_id experiment_id sample_id sampleset_id variant_call_type validation description sequence hgvs_name zygosity origin copy_number reference_copy_number external_links loss_of_heterozygosity clinical_significance phenotype mode_of_inheritence variant_call_length NOTES A unique identifier for the variant call. The identifier of the experiment (experiment_id from EXPERIMENTS worksheet) that generated this variant call. ** REQUIRED if the call was made in an individual sample. The sample identifier (sample_id from SAMPLES worksheet) in which the call was made. ** REQUIRED if the call was made in a sampleset. The sampleset identifier (sampleset_id from SAMPLESETS worksheet) in which the call was made. Choose a term from the variant_call_type CV. If the type is unknown use Sequence alteration. Indicate the procedure (as described in the EXPERIMENTS worksheet) and result of experimental validation for the call, in the format experiment_id:result. Use a comma-delimited list for more than one experimental validation, e.g. 2:Pass,3:Fail,4:Inconclusive Additional information about the variant call. The sequence of the variant if less than 100 base pairs. Please use valid IUPAC codes only. The Human Genome Variation Society name of the variant. The zygosity of the variant from zygosity CV. The origin of the variant from the origin CV.. The absolute (diploid or polyploid) copy number of the variant. The absolute (diploid or polyploid) copy number of the reference or wild-type. ** REQUIRED if the variant is an insertion of 'novel' sequence (i.e. does not map to the reference sequence). If data regarding this variant call has been deposited elsewhere, provide details. The required format is described in the Archive section of the Info-Links worksheet. Use a comma-delimited list for multiple entries. Indicate if loss of heterozygosity occurs with this variant. Choose a term from the list: No, Yes From the clinical_significance CV. ** REQUIRED if clinical_significance = pathogenic or probably pathogenic. Provide a phenotype using one or more identifiers from a recognised vocabulary. The required format is described in the Phenotype section of the Info-Links worksheet. Use a comma-delimited list for multiple entries. Text descriptions are not allowed. Choose a term from the mode_of_inheritence CV. An approximation for the length of the variant, if it cannot be determined from placement information Pag 11

Placement Information- Provide this in one of three formats, depending upon the level of information known about the location of the variant. Base Pair resolution, as determined by capillary or next generation sequencing techniques for instance. Min/Max interval, if the position of the variant is known relative only to two or more markers with known position, as determined by probe based techniques for instance. Cytogenic if genomic co-ordinates are unavailsble. assembly ** REQUIRED if known. The assembly to which the variant placement relates. This must be provided for placement information provided in format Base Pair resolution or Min/Max region. Base Pair resolution. Chromosome must be a number, X, Y, or MT (mitochondrial). If the chromosome is unknown, use a valid contig accession number instead. An optional confidence interval and can be provided in parentheses (values are modifiers of the start or stop position, not an absolute.) The forward (+) strand will be assumed if not provided. start The start position of the variant in the format chromosome:position(-value,value)strand e.g. 5:4000(- 100,150)+ stop The end position of the variant in the format chromosome:position(-value,value)strand e.g. 5:6000(-90,0)+ Min/Max interval. Define the minimum interval in which the variant is likely to exist - this may represent the positions of the first and last affected markers for instance. Define the maximum interval in which the variant is likely to exist. This may represent the positions of the first and last unaffected markers for instance. You may define only the minimum interval, only the maximum interval or both. Chromosome must be a number or MT (mitochondrial). If the chromosome is unknown, use a valid contig accession number instead. The forward (+) strand will be assumed if not provided. inner_start The start position of the minimum affected interval, in the format chromosome:positionstrand, e.g. 6:15000+ inner_stop The end position of the minimum affected interval, in the format chromosome:positionstrand, e.g. 6:20000+ outer_start The start position of the maximum affected interval, in the format chromosome:positionstrand, e.g. 6:10000+ outer_stop The stop position of the maximum affected interval, in the format chromosome:positionstrand, e.g. 6:25000+ cytoband taxonomy_id Cytogenic. A taxonomy ID must also be provided. The cytoband to which the variant maps. The NCBI taxonomy identifier of the sample in which the variant was called. To find a taxonomy ID, go to http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy, click the link for the species and look for "Taxonomy ID=". Pag 12

Controlled vocabulary for VARIANT_CALLS fields: variant_call_type zygosity origin loss_of_heterozygosity clinical_significan ce mode_of_inheriten ce strand complex Hemizygous de novo No association Autosomal dominant + Autosomal copy number gain Heterozygous Germline Yes confers sensitivity recessive - copy number loss Homozygous Maternal drug response Mitochondrial deletion Not tested no known pathogenicity Multifactorial duplication Paternal not provided X-linked dominant indel Present in both parents other X-linked recessive insertion Somatic pathogenic Y-linked interchromosomal breakpoint Tested - inconclusiv e probably not pathogenic intrachromosomal breakpoint probably pathogenic inversion protective mobile element insertion risk factor novel sequence insertion variant of unknown significance sequence alteration tandem duplication translocation Pag 13

VARIANT_REGIONS A variant region is a region asserted to contain structural variation based upon the evidence of one or more variant calls. Regions can also be based upon one or more region. FIELD variant_region_id assertion_method linked_variant_call_id linked_variant_region_id variant_region_type NOTES A unique identifier for the variant region. The assertion method defines how the variant calls (or regions) have been combined to define this variant region. This could be a computer algorithm, in which case provide the name and parameters, otherwise provide details, e.g. manual observation - calls that overlap by at least 70% support this region. ** REQUIRED if the region is based upon variant calls. A comma-delimited list of variant call identifiers (variant_call_id from VARIANT_CALLS worksheet) used in the assertion method, and on which this region is based. ** REQUIRED if the region is based upon variant regions. A comma-delimited list of variant region identifiers (variant_region_id from this worksheet) used in the assertion method, and on which this region is based. The type of variant region, from the variant_region_type CV. If the linked variant calls are of type 'Copy number gain/duplication' and/or 'Copy number loss/deletion', enter 'Copy number variation'. For any other variant call types, if the linked variant calls are all of the same type, enter the same type as the variant calls. If the linked variant calls are of more than one different type, enter Complex. description Additional information about the region. Placement Information- Provide this in one of three formats, depending upon the level of information known about the location of the region. Base Pair resolution, as determined by capillary or next generation sequencing techniques for instance. Min/Max interval; if the position of the region is known relative only to two or more markers with known position, as determined by probe based techniques for instance. Cytogenic, as determined by FISH for instance. assembly ** REQUIRED if known. The assembly to which the region placement relates. This must be provided for placement information provided in format Base Pair resolution or Min/Max region. Base Pair resolution. Chromosome must be a number, X, Y or MT (mitochondrial). If the chromosome is unknown, use a valid contig accession number instead. An optional confidence interval and can be provided in parentheses (values are modifiers of the start or stop position, not an absolute.) The forward (+) strand will be assumed if not provided. start The start position of the region in the format chromosome:position(-value,value)strand e.g. 5:4000(-100,150)+ stop The end position of the region in the format chromosome:position(-value,value)strand e.g. 5:6000(-90,0)+ Min/Max interval. Define the minimum interval in which the region is likely to exist - this may represent the positions of the first and last affected markers for instance. Define the maximum interval in which the variant is likely to exist. This may represent the positions of the first and last unaffected markers for instance. You may define only the minimum interval, only the maximum interval or both. Chromosome must be a number or MT (mitochondrial). If the chromosome is unknown, use a valid contig accession number instead. The forward (+) strand will be assumed if not provided. Pag 14

inner_start The start position of the minimum affected region, in the format chromosome:positionstrand, e.g. 6:15000+ inner_stop The end position of the minimum affected region, in the format chromosome:positionstrand, e.g. 6:20000+ outer_start The start position of the maximum affected region, in the format chromosome:positionstrand, e.g. 6:10000+ outer_stop The stop position of the maximum affected region, in the format chromosome:positionstrand, e.g. 6:25000+ cytoband taxonomy_id The cytoband to which the region maps. Cytogenic. A taxonomy ID must also be provided. The NCBI taxonomy identifier of the sample in which the region exists. To find a taxonomy ID, go to http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy, click the link for the species and look for "Taxonomy ID=". Controlled vocabulary for VARIANT_REGIONS fields: variant_region_type strand complex + copy number variation - indel insertion interchromosomal breakpoint intrachromosomal breakpoint inversion mobile element insertion novel sequence insertion sequence alteration tandem duplication translocation Pag 15

Info Links Archive Phenotype Sample resource Database Short name URL Example input format (short name:accession number) ArrayExpress AE http://www.ebi.ac.uk/arrayexpress/ AE:A-MEXP-1162 Catalogue of Somatic Mutations in http://www.sanger.ac.uk/perl/genetics/cgp/directive?id= Cancer COSMIC COSMIC:COST16850 Database of Genotypes and Phenotypes dbgap http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap dbgap:phs000385 dbsnp (Short Genetic Variations) dbsnp http://www.ncbi.nlm.nih.gov/projects/snp/ dbsnp:ss49857710 DNA Database of Japan DDBJ http://www.ddbj.nig.ac.jp/ DDBJ:EF472314 dbvar dbvar http://www/ncbi.nlm.nih.gov/dbvar/ dbvar:esv10000 European Genome-phenome Archive EGA http://www.ebi.ac.uk/ega/ EGA:EGAS10101234567 European Nucleotide Archive ENA http://www.ebi.ac.uk/ena/ ENA:SRX073510 GenBank GENBANK http://www.ncbi.nlm.nih.gov/nuccore GENBANK:AC123456 Gene Expression Omnibus GEO http://www.ncbi.nlm.nih.gov/geo/ GEO:GPL4010 Probe PROBE http://www.ncbi.nlm.nih.gov/genome/probe/ PROBE:Pr153103 Sequence Read Archive SRA http://www.ncbi.nlm.nih.gov/sra SRA:SRX073508 Trace Archive TRACE http://www.ncbi.nlm.nih.gov/traces/home/ TRACE:FJ043059 ClinVar ClinVar http://www.ncbi.nlm.nih.gov/clinvar/ ClinVar:RCV000010316 http://www.human-phenotypeontology.org/index.php/hpo_home.html Human Phenotype Ontology HPO HPO:0000572 GENE GENE http://www.ncbi.nlm.nih.gov/gene/ GENE:CBX3 GeneReviews GeneReviews http://www.ncbi.nlm.nih.gov/books/ GeneReviews:NBK50780 MedGen MedGen http://www.ncbi.nlm.noh.gov/medgen/ MedGen:8083 Medical Subject Headings MeSH http://www.nlm.nih.gov/mesh/ MeSH:D002658 Online Mendelian Inheritance in Man OMIM http://www.ncbi.nlm.nih.gov/omim OMIM:105830 Systematized Nomenclature of Medicine SNOMED http://www.nlm.nih.gov/research/umls/snomed/ SNOMED:277894008 Unified Medical Language System UMLS http://www.nlm.nih.gov/research/umls/ UMLS:C0019202 BioSample Database BioSD http://www.ebi.ac.uk/biosamples/ BioSD:SAM123456 Clone Registry CLONE http://www.ncbi.nlm.nih.gov/clone CLONE:ABC8 Coriell Institute CORIELL http://ccr.coriell.org/ CORIELL:ND22263 The Jackson Laboratory JAX http://jaxmice.jax.org/ JAX:012453 Service PubMed PubMed http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed PubMed:10025899 Pag 16