The FASTG Format Specification (v1.00)
|
|
|
- Nickolas Bell
- 10 years ago
- Views:
Transcription
1 The FASTG Format Specification (v1.00) An expressive representation for genome assemblies The FASTG Format Specification Working Group December 12, Introduction This document introduces a prototype format for faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty. It is called FASTG, like FASTA, but the G stands for graph. 1 As motivation for this new representation, consider the traditional linear representation of genome assemblies, which exhibits each scaffold as a FASTA record. Key advantages of this representation are that its linearity matches the linearity of the genome, and that it naturally provides coordinates, thus facilitating computation and display. It is also very easy to parse, easy for end- users to read and interpret, permits flexible annotations with custom fields, and perhaps most importantly has a universe of tools built around it. This traditional linear representation, however, has one deep flaw it has very limited capacity to express branching that can arise from allelic polymorphism, intrinsic limitations of the data, or even limitations of the algorithms used to assemble them. This inability to represent branching and retain intrinsic ambiguities has many negative consequences, including the forced introduction of errors (where instead uncertainty could be shown), and typically, the failure to express information about polymorphism that is contained in the raw data and typically available internal to assembly algorithms. In short, an honest assembly would tell you if it didn t know the answer, or if more than one answer was possible, and to do this a representation capable of branching would be required. This is not a new concept, and indeed the traditional assembly representation had two mechanisms for representing branching and uncertainty: (1) ambiguous base codes, according to the IUPAC scheme, and (2) base quality scores (via an ancillary quality score file, in one- to- one correspondence with the FASTA file). Both are limited in their expressive capability, and while (1) is exceptionally straightforward and easy to understand, the interpretation of (2) is less clear, particularly in regard to indels. Finding an appropriate replacement will involve a balance between expressive capability and complexity. A very general solution would be to represent assemblies as large, all- inclusive graphs that capture the information that the assembly algorithm has extracted from an inaccurate and redundant raw dataset. This solution, however, would lose the key advantages of the traditional linear representation, as the 1 FASTA originally stood for a fast algorithm that could align any letters (not just nucleotides or protein symbols), but came to represent the file format inputted to that algorithm [1], [2], 1
2 resulting all- inclusive graphs can be complex and lose many of the advantages in the traditional linear representation that we wish to retain. Here we propose instead a hybrid approach that preserves local linearity (and its associated benefits) where possible but also captures additional nonlinear features of the assembly. In essence, the FASTG format offers FASTA- plus. Any assembly in this format can be canonically converted into FASTA, albeit with loss of information, and this FASTA could have a coordinate- based markup that precisely describes the lost information. Such a canonical conversion to conventional FASTA format seems to be a prerequisite for any format to be widely adopted and generally useful. Graphical features that are needed may be roughly characterized as local or global. For example, if we know that a particular base is a SNP, then that s a local feature. The most common local features look like one of the following: and so we ve designed the format to handle these types of events easily. Importantly, the first three all preserve linearity relative to the genome by capturing variation in the form of a directed acyclic graph, whereas the fourth ( 4) allows cycles, but in highly controlled context. While many assemblies can be represented without global features at all, some are unavoidable. Consider for example the genome of the bacterium Rhodobacter sphaeroides. It has two circular chromosomes, containing three instances of a 5.2 kb perfect repeat R, all in the same orientation. Now supposing that one s assembly input data includes long links that can bridge R, it is in principle possible to complete resolve its genome, yielding two circular chromosomes, which would be represented by two fasta records, one per chromosome. On the other hand, suppose that one does not have such long links. Then the best one can do is to represent the assembly as a graph having four edges which would correspond to four fasta records. 2
3 Note that a perfect assembly would have essentially no local or global structure, and could be represented by one FASTG record per physical chromosome (thus with homologous chromosomes represented separately). Additional structure would only be needed to capture the difference between linear and circular chromosomes. In particular, in such a perfect assembly, repetitive sequences would appear multiple times in the assembly, in accordance with their actual copy number and local sequence variations. For some small genomes, such perfect assemblies could in principle be achieved with existing data, and even for large genomes, it makes sense to strive for assemblies that are as perfect as possible. Fully perfect assemblies of large genomes are unlikely to be produced in the near future, as they imply (a) complete phasing of all allelic variations, which requires a network of paired- end information at many length scales, or sequence from other family members, and (b) methods for disentangling repetitive sequences with complex nested or tandem structures that are inaccessible, even theoretically, with existing data types. Below we first present some intuitive examples, without defining all the terms in advance, and then dive deeper into the formal definition of the terms. We first note the following design principles: In this format, it is a rule that every construct can be converted into FASTA. Thus for example [1:alt A,T,AA] is an example of an ambiguity string representing either A or T or AA, and by definition the first alternative (in this case A) would be used in the canonical conversion to FASTA. The canonical representation of each FASTG construct is given just prior to the construct itself. In the above example [1:alt A,T,AA] would be preceded by A. (See next section for an example.) As a convenience, the length of this canonical sequence is encoded as the first field in the FASTG construct. We adapted these conventions so that FASTG could be converted to its canonical FASTA form using only a few lines of code. This format uses inline representation, in which constructs appear inline as insertions in a FASTA file. In an alternate version of the format, the constructs would appear instead in a coordinate- based markup that accompanies the FASTA as an ancillary file. This markup format is simply the FASTG constructs plus the base offset at which they occur and is described later. We anticipate that other versions of markup for FASTG may be needed but do not attempt to specify them here. The format includes directed graphs in which each edge represents a DNA sequence (as described in 6). A path through the graph is to be interpreted as corresponding to the concatenation of these sequences. We allow for reverse complements, so as to allow for the representation of complex inversion events. Here we have borrowed from the notion of connection, as in the SeQuence Graph (SQG) file format ( The format is hierarchical, thus allowing a distinction between global and local structure, and effectively sequestering graph structure, so as to present assemblies in as linear and FASTA- like a form as possible. At least three levels might be needed. For example, this would be needed to describe an assembly having global structure, and within that, haplotypes separated over a fairly long range, and within each haplotype branch, uncertainty about particular bases. 2. A toy example Consider the following simple representation of an assembly of two chromosomes that includes a gap, a SNP, and a tandem repeat embedded within otherwise known sequence: #FASTG:begin; #FASTG:version=1.0:assembly_name= tiny example ; >chr1:chr1; ACGANNNNN[5:gap:size=(5,4..6)]CAGGC[1:alt:allele C,G]TATACG >chr2; 3
4 ACATACGCATATATATATATATATATAT[20:tandem:size=(10,8..12) AT]TCAGGCA[1:alt A, T,TT]GGAC #FASTG:end; This is a complete FASTG file, with required begin and end lines, however which for brevity, we will drop from subsequent examples. In a FASTG file, all white space (blanks, tabs and newlines) is ignored (except in double- quoted literal strings), so we are free to insert it wherever we like to enhance readability. The example represents an assembly having two scaffolds named chr1 and chr2. Here are some of its features: The expression [5:gap:size=(5,4..6)] represents a gap of between 4 and 6 bases, with default 5. Since the default is 5, the preceding canonical sequence is NNNNN. The expression [1:alt:allele C,G] represents an ambiguity : it is either C or G. The canonical sequence is C, and so the size of the FASTG construct is 1. The allele property specifies that both C and G are present in the sample at the given locus, and in equal proportions. Typically this would be because two homologous chromosomes are represented by a single scaffold in the assembly. Alternatively, a single scaffold could collapse two highly similar regions that are not homologous. The expression [20:tandem:size=(10,8..12) AT] represents a run of between 8 and 12 ATs. Because 10 is listed first, it is regarded as the primary representation and therefore the canonical sequence is 10 ATs. In this case no information is provided as to the relative likelihood of the different run lengths, however, as is always the case for such ambiguous expressions, it is guaranteed that at least one of the runs appears in the sample at the given locus. Note that because of polymerase slippage, ambiguities of this type are very common, and indeed may correspond to (1) the primary error mode in input data, and (2) a common polymorphism in complex genomes. With existing whole- genome shotgun data types it is often impossible to distinguish between these two situations. The expression [1:alt A,T,TT] asserts that at least one of A, T or TT is present in the sample at the given locus. In this situation, the canonical sequence is defined to be A, and so the size of the FASTG construct is given as 1. Importantly, in this format there is always a fully specified conversion of an assembly into pure FASTA. Every expression comes equipped with a mechanism for picking a particular sequence wherever there is an option, which is typically the first in a list. Here is the mapping to FASTA for each of the above four expressions: [5:gap:size=(5,4..6)] NNNNN [1:alt:allele C,G] C [20:tandem:size=(10,8..12) AT] ATATATATATATATATATAT [1:alt A,T,TT] A These canonical sequences are always given just prior to the FASTG constructs. Simply stripping out the constructs gives the corresponding FASTA, with loss of information. In the example below the canonical sequence is in red, whilst the FASTG construct it represents is in bold. (color added for emphasis): >chr1:chr1; ACGANNNNN[5:gap:size=(5,4..6)]CAGGC[1:alt:allele C,G]TATACG >chr2; ACATACGCATATATATATATATATATAT[20:tandem:size=(10,8..12) AT]TCAGGCA[1:alt A, T,TT]GGAC 4
5 Becomes: >chr1:chr1; ACGANNNNNCAGGCTATACG >chr2; ACATACGCATATATATATATATATATATTCAGGCAGGAC There is one more piece of information encoded in the assembly via the FASTG header line. The notation chr1:chr1 declares that chr1 is adjacent to itself. In other words, it is a circular edge. On the other hand, chr2 does not has no adjacencies, indicating that it does not connect to anything else. In short, a graphical view of the assembly is: which exhibits one little bit of global structure, namely the circularity of chr1. 3. Representation of haplotypes Depending on the properties of a genome and data from it, it may be possible to separate haplotypes over a relatively long range. FASTG is set up to support this. For example, an alt expression can be used to describe a haplotype bubble in the case where the individual haplotypes are completely known, e.g. [30:alt CATGGCAACCTAGCA CCTATACGAGCTTAC, CAAGGCAACCAAGCAACCTATACGAGCTTAC] FASTG also allows for cases where the haplotypes themselves have features (such as uncertain bases or gaps), using the digraph expression, e.g. [30:digraph:path=(a),begin=(a,b),end=(a,b) >a; CATGGCAACCTAGCA CCTATACGA G[1:alt G,C] CTTAC >b; CAAGGCAACCAAGCAACCTATACGA G CTTAC ] This represents a very simple graph, having two parallel edges. The notation path=(a) specifies that the canonical path through the graph is given by the first edge. This example also illustrates the nested nature of FASTG, where it is possible to encode a FASTG construct within another FASTG construct. The entry point to the graph is given by the property begin, which contains a list of one or more edges. Similarly the property end gives the exit point of the graph, again as a list of edges. These provide the linking information between the edges in the subgraph and the surrounding sequence. For this simple case where there are only parallel edges, the begin and end properties may be omitted. 5
6 4. A stuffed gap Consider the following assembly, which has been generously padded with white space to enhance readability: >Z; ATAT NNNNNNNNNNNNNNN [13:gap:size=(13,10..35), begin=a, end=g >a:b,c,d; GGG >b:b,c,d; A >c:e,f; CA >d:g; CAC >e:e,f; T >f:g; AA >g; CCC ] GATGAT This assembly contains a single scaffold Z, having one gap. However it is not an ordinary gap: it is a stuffed gap, with a mini- assembly- graph inside the gap, as in the following picture where the edge names are given in red: The assembly claims that the true genome sequence for the gap is represented by some path through the boxed gap graph of length between 10 and 35. This is actually a very common situation. From an assembly, one can often exhibit a graph corresponding to what should be inside a gap, but either due to limitations of the data or other algorithmic reasons, be unable to resolve the exact sequence. Since we don t know the most likely path, the canonical sequence is give as a gap of 13 bases represented as 13 Ns. Such incomplete information can be exceptionally valuable. For example, we have seen cases where a disease- causing frameshift mutation is hiding in a (stuffed) gap. From the gap graph, the existence of the mutation can be demonstrated, but the data lack the power to fully resolve the exact sequence of the gap. Nevertheless, it can be identified and targeted for further experimental study. 6
7 5. Overview of the format Here we provide an informal overview of the format, deferring details to the appendices. The core unit of the format is a FASTA- like construction ( 14), that can exhibit a DNA sequence and its connection to other sequences that comprise a sequence digraph (see next section). For example: >x:y; ACGGACAT Here x represents both a DNA sequence and an edge in a graph. This edge is in turn followed by edge y i.e. there exists an adjacency from edge x to edge y. To facilitate the representation of inversions, the format also allows for adjacencies between forward and reverse complement (RC) of edges see the next section for more details. Next, the format allows for the direct embedding of an acyclic graph inside FASTA (for an example see 8), and for several convenient specializations of this, in particular ambiguities involving a list of alternatives, tandem repeats, and heterozygous inversions, and cases where the orientation of a sequence is unknown, e.g. if it was placed using a genetic map ( 15). Finally the format can represent gaps of prescribed distance ( 15), with error bars, and gaps that are stuffed with an unresolved graph (for an example see 4). 6. Semantics The construction of FASTG depends on a core object that we refer to as a sequence digraph. First we provide a mathematical description, then we explain how it is encoded in FASTG. The main challenge is with the encoding of reverse complements. Mathematical description of sequence digraph. A sequence digraph consists of a collection of DNA sequences, say e1,..., en, also called edges, and a collection of pairs xy, called adjacencies, where x and y are either edges or their reverse complements, denoted by the addition of a prime character. Very explicitly, this is: adjacency interpretation eiej follow ei then follow ej eiej follow ei then follow the reverse complement of ej ei ej follow the reverse complement of ei then follow ej ei ej follow the reverse complement of ei then follow the reverse complement of ej. We note that in a sequence digraph, ej ei is not treated in the same way as eiej, as we will explain shortly. Note also that strictly speaking, a sequence digraph is not a graph at all, as we have not specified a notion of vertex. However in many cases one can without ambiguity define vertices and thereby associate a bona fide digraph, and we do so frequently in this document to illustrate concepts. Now given a sequence digraph having edges e1,..., en, a path is a word in the alphabet {e1,..., en, e1,..., en } such that each pair of successive symbols are adjacencies. To every path we associate a DNA sequence by concatenating the associated edges, or their reverse complements, if so specified. To circular paths we associate circular DNA sequences. FASTG encoding of sequence digraph. The header line for a FASTG record encodes an edge, together with a set of adjacencies, and properties, as follows: >Edge:Neighbors:Properties; 7
8 - Edge is the name given to this edge/sequence. - Neighbors is a list of edges (or their reverse complements, indicated by a ) that follow this edge or the reverse complement of this edge (indicated by a preceding ~). - Properties is a list of optional properties associated with this edge (discussed later in this document). The empty Neighbor list (given by ::) is valid, but is only necessary if the Property field is populated. Some examples: >Edge1; specifies Edge1 but no adjacencies or properties. >Edge1:Edge2,Edge3; Edge1 is followed by Edge2 and Edge3. >Edge1:Edge2,Edge3 ; Edge1 is followed by Edge2 and the RC of Edge3. >Edge1:Edge2,Edge3 :prop1=value1; As above, but with property prop1 defined. >Edge1:Edge2,Edge3,~Edge4:prop1=value1; As above, but the RC of edge1 is followed by Edge4. Properties may be associated with each adjacency using the following notation: >Edge1:Edge2[prop1=value1],Edge3; The adjacency between Edge1 and Edge2 has property prop1 defined. 7. Nested graph features The FASTG format supports global and local graph features. Further nesting of local graph features is allowed only in the case of the [digraph: ] construct see the example in Section 3. Only one additional level is allowed nesting cannot continue indefinitely. Nesting in this fashion is not allowed for any other construct. In summary: digraph may contain alt, tandem, digraph, gap, or stuffed_gap. digraph may be nested only once. alt, tandem, gap, stuffed_gap cannot contain additional nested graph features. 8. Examples where reverse complementation is unavoidable Here we illustrate situations where reverse complement edges (see 6) are needed. Consider first the case of an allelic heterozygous inversion. In fact FASTG provides directly for this, inside a single FASTG record, using the [digraph:bioriented...] notation, and this is the preferred usage because it is better to stay within a single record, which has a single coordinate system. However, to illustrate the digraph notation, we exhibit a direct encoding. Suppose we wish to allow for A{I or I }B, which we might describe informally using the following picture, in which the four red edges represent adjacencies: Then we have three edges A, I and B and four adjacencies AI, AI, IB, I B. In FASTG we would write: >A:I,I ; (Sequence of A) encodes AI and AI 8
9 >I:B,~B; (Sequence of I) >B; (Sequence of B) encodes IB and I B encodes no adjacencies Now consider an alternate and looser semantics for an inversion event, in which we know less. Specifically, we could add the reverse complement of all adjacencies, thus allowing not just for the standard paths AIB and AI B, but also for paths like AIA. To do so we would write: >A:I,I ; (Sequence of A) >I:A,B,~A,~B; (Sequence of I) >B:~I,~I ; (Sequence of B) encodes AI and AI encodes IA, IB, I A and I B encodes B I and B I We give one more example. Consider, as in 1, a genome having two circular chromosomes, and three copies of a long repeat R, one on one chromosome and two on the other. However now suppose that the latter two copies are in inverse orientation. Let s use the following shorthand for the genome: chr1: -- A -- R -- chr2: -- B -- R -- C -- R' -- where R denotes the reverse complement of R. Now if we have a data set for this genome that includes links long enough to bridge R, then there is no problem: the genome may (in principle) be fully resolved as two circles. But suppose we do not have these long links. Then it is impossible to fully resolve the genome structure from the data: there is more than one solution. To encode this in FASTG, all we have to do is read off the adjacencies from chr1 and chr2, allowing for circularity: chr1: AR, RA chr2: BR, RC, CR, R B and then add in the reverse complements of these adjacencies: chr1: R A, A R chr2: R B, C R, RC, B R. This is very much in the spirit of the loose interpretation of a heterozygous inversion described above, and informally illustrated as follows, with red arrows representing the twelve adjacencies: It is easy to encode this assembly in FASTG, as all we have to do is copy the twelve adjacencies into the notation: 9
10 >A:R,~R ; (sequence of A) >B:R,~R; (sequence of B) >C:R,~R ; (sequence of C) >R:A,C,C,~A,~B,~B ; (sequence of R) encodes AR and A R encodes BR and B R encodes CR and C R encodes RA, RC, RC, R A, R B and R B. To illustrate how this works, here are the paths through the graph that reconstruct the true genome: chr1: A R A chr2: B R C R' B There are other paths through the graph that reconstruct fake genomes. The data do not allow us to distinguish between the true genomes and these fakes. 9. More toy examples of assemblies (a) This example exhibits an acyclic graph, embedded in an assembly. >xxx; GTAAAAACTAC ATATATGTTTTT [12:digraph:path=(a,b1,c), start=a, end=c >a:b1,b2; ATATAT >b1:c; G >b2:c; C >c; TTTTT ] ACACACAC This embedded acyclic graph can be visualized as follows: There are two paths through the graph: a b1 c, corresponding to the sequence ATATATGTTTTT, and a b2 c, corresponding to the sequence ATATATCTTTTT. The semantics guarantee that one or more of these sequences is present in the sample (at the given locus). The notation a,b1,c declares a canonical path through the graph, and from this one deduces the canonical FASTA representation for the assembly >xxx; 10
11 GTAAAAACTAC ATATAT G TTTTT ACACACAC (b) Example (a) has the following equivalent and much simpler representation: >xxx; GTAAAAACTAC ATATAT G[1:alt G,C] TTTTT ACACACAC 10. Copy number statistics Here we describe a generate framework for quantifying information about copy number, and provide certain simple abbreviations that encapsulate the most common cases. We expect that in the vast majority of situations, only these common cases would be used. We provide three vehicles for representing copy number information. First, as described earlier, the tandem notation provides for specification of the number of copies of a tandem repeat. Second, we allow for representation of external copy number. This provides information about the relative abundance of the connected components of the top- level graph in a FASTG file. It is encoded via the property cn_external, that may be assigned to any top- level FASTG record (but only consistently: records in the same component may not be assigned different values). The values for cn_external are to be interpreted relative to each other. For example, suppose an assembly consists of mitochondrial and nuclear genomes, present as single FASTG records and of molarity 10:1 in the sample. Then this information might be represented as >mitochondrial_genome::cn_external=10;... >nuclear_genome::cn_external=1;... We leave specification of error bars for cn_external to a future version of FASTG, should they prove needed. Third, suppose we have alternative sequences x, y, for example as in [1:alt A,T]. We may be able to express information about their likelihood of presence in the sample and their relative abundance. This could be encoded in a probability density function (pdf) over the diagonal segment connecting the points (0,1) and (1,0), 11
12 with each point on the segment representing a hypothetical distribution of copy number, summing to one. In this version of FASTG we provide for the specification of the pdf in the case where it is discrete, i.e. concentrated at a finite set of points, and in the continuous case where there is no information, i.e. equal probability of all cases. We leave open the possibility that future versions of FASTG would allow specification of other continuous pdfs, as needed. Such discrete pdfs may be expressed via a property cn_discrete_pdf (where cn is short for copy number), as the following examples illustrate: (a) semantics: we are certain that two alleles are present in equal proportions pdf: concentrated at the single point (0.5,0.5) long form: cn_discrete_pdf = ( ((0.5,0.5),1) ) short form: allele example: [1:alt:allele A,T] (b) semantics: we are certain that only one allele is present, and assign equal probabilities to both pdf: concentrated equally at two points (1,0) and (0,1) long form: cn_discrete_pdf = ( ((1,0),0.5), ((0,1),0.5) ) short form: exclusive example: [1:alt:exclusive A,T] (c) semantics: we are certain that only one allele is present, and assign probability 0.99 to the first pdf: concentrated at the two points (1,0) and (0,1), with 99% of the mass on the first point long form: cn_discrete_pdf = ( ((1,0),0.99), ((0,1),0.01) ) short form: exclusive=(0.99,0.01) example: [1:alt:exclusive=(0.99,0.01) A,T] (d) semantics: we have no information about the relative presence of the two alleles pdf: uniform along the line segment long form: none short form: (default) example: [1:alt A,T] These concepts and notations generalize readily to the case of three or more alternatives. Note that to avoid limitations of floating point arithmetic, we allow for scaling of numbers in the discrete pdf representations, for example 12
13 cn_discrete_pdf = ( ((1,1,1),1) ) corresponding to three alleles present in equal proportions, understanding that this corresponds to the pdf being concentrated at the single point (1/3,1/3,1/3). The same representation of copy number for alternatives is allowed for the digraph notation, provided that represents a collection of parallel edges (like alt does). 11. Assembly assessment (a) Investigators will want to report the properties of assemblies and we note that replacement of the traditional format by a new format will entail changes to that reporting. While falling outside the format per se, this has some bearing on the language used by it. We also note the potential value in introducing a standard for assessment. Since a mechanism for canonical conversion to conventional FASTA is provided, in the short term reporting can continue to use current methods applied to the canonical FASTA, although this does not take full advantage of the new format. (b) Assemblers should be penalized for all ambiguity that is not present in the genome. However, it is clearly better to report an ambiguity than to make an error. It is not clear exactly how ambiguity should be counted. (c) The terms contig and scaffold need to be clarified, so that (for example) one could report the N50 contig or scaffold size. For example, we would argue that stuffed gaps should be between contigs (rather than within them), for the following reasons: Inclusion within contigs could result in huge contig sizes, causing the community to think that assembly is a solved problem. It makes sense to incentivize the development of data and algorithms that will lead to the full resolution of gaps. At the same time, it makes sense to also incentivize the stuffing of gaps, so this should be reflected somehow in assembly assessment statistics. (d) In the case of a controlled experiment, where a reference sequence for the genome is available, we note that the assembly is regarded as correct if it traverses a valid path, i.e. through at least one alternative (in the case of ambiguities), and similarly for the case of stuffed gaps. 12. Miscellaneous issues While users may want to see the evidence for an assembly, typically exhibited by read alignments, we recommend that these data be presented separately from the assembly per se. One reason for this is that the evidence data could easily be 100 times larger than the assembly. However the format might specify the mechanism for presentation of evidence. For example Binary Alignment/Map (BAM) might be used, but there would need to be a naming convention to coordinate between assembly and evidence files. Also the BAM format might need to be adopted to insure that it is robust enough to handle reads that span across branch boundaries. 13. Appendix: character set and names Here we begin the detailed formal description of the assembly notation. First, the alphabet for the format consists of : 13
14 a-za-z0-9_.,[] =>:;-~()# and white space (defined here to be blanks, newlines and tabs), which ignored, except between matched pairs of double quotes (thus allowing for comments and other literal strings). White space may be inserted to enhance readability. Here are a few definitions: name_string := {a-za-z0-9_}+ value_string := {a-za-z0-9_()., }+ subject to: - For any comma in the string, the number of left and right parentheses to the left of it may not be equal. This restriction is needed to allow unambiguous parsing of FASTG. - The double quote character may only be used in a literal value_string, having the form..., where the intervening text is a string of length zero or more from the value_string character set, but without double quotes. Note in particular that blanks are allowed in such a literal string, and treated as part of the value, whereas white space is ignored everywhere else in FASTG. Note also that the quotes themselves are not treated as part of the value, so that x=1 and x= 1 have the same semantics, but that x= Homo sapiens has no quote- free equivalent. property_string := a comma- separated list consisting one or more strings of the form - name=value, where name is a name_string and value is a value_string or - name, where name is a name_string, which is short for the flag name=1. The FASTG format defines a small number of properties. Moreover, we allow the arbitrary introduction of user- defined properties, with the intention that these be gradually and selectively absorbed into the format if they turn out to have general utility. Each FASTG file starts with the following special text, with the appropriate version number substituted in: #FASTG:begin; #FASTG:version=*.*; #FASTG:properties; where properties is a property_string, defining global properties of the FASTG file. It is possible to combine these into a single line: #FASTG:begin:version=*.*:properties; The list of properties can be split over multiple lines, each starting with #FASTG, but individual properties shouldn t be split: #FASTG:properties; 14
15 #FASTG:more_properties; Each FASTG file ends with the following special text: #FASTG:end; On #FASTG lines spaces are ignored (as elsewhere, except in comments and quoted strings, as described below). The semicolon on the #FASTG:end; line is regarded as the terminating character for the FASTG file. The following global properties are defined in this version of FASTG: organism assembly_program_name assembly_program_version assembly_run_date assembly_run_number assembly_name comment All text occurring between a # and a newline is treated as a comment and ignored unless it begins with #FASTG. Every sequence in the format is assigned a name. These are relative, so if name x2 appears within something named x1, then its absolute designation would be x1:x Appendix: number notation nonnegative_integer_string := 0 {1-9}{0-9}* integer_string := 0 {1-9}{0-9}* -{1-9}{0-9}* nonnegative_integer_range_string := n or m..n - m, and n are nonnegative_integer_strings, and m < n integer_range_string := n or m..n - m, and n are integer_strings, and m < n abbreviated_nonnegative_integer_list_string := x1,...,xn - each xi is a nonnegative_integer_range_string. Note that duplicates in the list are to be ignored. Example: 15, is a way of listing the integers from 10 to 20, with 15 first. abbreviated_integer_list_string := x1,...,xn 15
16 each xi is a integer_range_string. Duplicates in the list are to be ignored. positive_float_string := {1-9}{0-9}+.{0-9}+ 0.{0-9}+ {1-9}{0-9}* float_string := 0 or x or -x where x is a positive_float_string. probability_string := a float_string p, with 0 < p Appendix: notation for bases and sequence digraphs base_string := {ACGT}* We deliberately exclude N from the base_string definition, because historically, N has been overloaded : it has been used to define gaps (as part of a run), ambiguous bases (meaning A or C or G or T), and also to mark other types of uncertainty, including possible indel errors in an assembly. We prefer to require that assemblers precisely specify their intended meaning, which should be possible using the [gap...] and [alt...] constructs of this format. Note that empty base_strings are allowed. Note also that IUPAC ambiguous bases and lower- case bases are not allowed. canonical_base_string := {ACGTN}* We allow N in the canonical_base_string, to support the encoding of gaps. These strings can only occur if they precede a bracket expressions (embedded_diagraph_string, ambiguity_string, tandem_string, gap_object_string, and stuffed_gap_object_string), they cannot occur alone. Note that empty canonical_base_strings are allowed. Note also that IUPAC ambiguous bases and lower- case bases are not allowed. fasta_string := >edge:edge 1,,edgen:properties;bases or >edge: edge 1,, edgen;bases or >edge;bases - edge, edgei are name_strings - - bases is a base_string properties is a property_string. Currently the only recognized property in this context is cn_external ( 8). It is conventional (but not required) to have a newline after the semicolon. The semicolons are needed to allow for consistent handling of white space. digraph_string := b1,...,bn - each bi is a fasta_string, or a fasta_string followed by a to indicate the RC. The order of the bi does not matter. 16
17 The semantics are as follows. A path through a digraph of this type hops along a sequence of edges, corresponding to either fasta_strings, referred to here as sequence edges. Normally all edges must be read forward, unless indicated by a following which indicates the RC of an edge. Only paths that follow valid adjacencies between the edge are allowed. 16. Appendix: basic FASTA strings A basic_fasta_string is defined to be either a base_string, or any of the other strings defined in this section. In this section we define the five bracket expressions plus their canonical representation, which all have the form canonical_base_string[size:type <:properties>...] where size is the length in bases of the preceding canonical FASTA sequence, type is one of digraph, alt, tandem or gap. The properties may always include name=name, where name is a name_string, thereby assigning a name to the bracket expression. For purposes of FASTG processing, if a bracket expression is not explicitly assigned a name, it will be assigned a unique name by the processor. embedded_digraph_string := [size:digraph <:properties> g] - size is a nonnegative_integer_string giving the size of preceding canonical FASTA sequence. - properties is a property_string; apart from name and cn_discrete_pdf ( 8), only the following are assigned semantics at present: unoriented and bioriented, and presence of both is not allowed; also we require path=(s1,...,sk) where each si is a name_string, representing edge names in g, defining a canonical path through the graph - two properties, begin and end, are used to indicate the entry and exit edges to and from the graph respectively. They may be omitted only if the graph is simply a set of parallel alternative edges. - g is a super_digraph_string, representing a digraph. Notes: This definition is recursive: we do not define super_digraph_strings until the next section. Cycles are not allowed within an embedded digraph, as it would allow for unbounded cycling. Note that it is acceptable to include cycles within a stuffed gap (defined below), since in that context the gap size defines a limit on the number of times the cycle would be traversed. If the unoriented flag is present, then the construct represents either the given sequence, or its reverse complement, implying that at least one is present at the given locus. Such data arise naturally from genetic mapping, and also from long inverted repeats that are not bridged by read pairs. Note that in such situations it is probably not possible to rule out the possibility that both the sequence and its reverse complement are present (cf. below). If the bioriented flag is present, then both the sequence and its reverse complement are present, as would be the case for an allelic heterozygous inversion. ambiguity_string := [size:alt <:properties> x1,...,xn] 17
18 - - - size is a nonnegative_integer_string giving the size of preceding canonical FASTA sequence. properties is a property_string each xi is either a base_string or else of the form yi:properties where yi is a base_string and properties is a property_string. Semantics: it is guaranteed that at least one of the xi appears at the given locus. Note that it matters which xi is first, because that xi is inserted in the canonical FASTA representation. Otherwise, order doesn t matter. Note that an ambiguity_string is a special case of an embedded_digraph_string. tandem_string := [size:tandem <:properties> bases] - size is a nonnegative_integer_string giving the size of preceding canonical FASTA sequence. - name is a name_string - properties is a property_string, with a single recognized and required property size=(l) where l is an abbreviated_nonnegative_integer_list_string - bases is a nonempty base_string. Note that a tandem_string is a special case of an ambiguity_string. The same comments about order apply. gap_object_string := [size:gap <:properties>] - size is a nonnegative_integer_string giving the size of preceding canonical FASTA sequence. - properties is a property_string, with (apart from name) a single recognized property size=(l) which is required, where l is an abbreviated_nonnegative_integer_list_string. The canonical FASTA representation consists of max(1, l1) Ns. stuffed_gap_object_string := [size:gap <:properties> g] - size is a nonnegative_integer_string giving the size of preceding canonical FASTA sequence. - properties and l are as in a gap_object_string; the properties may include path=(s1,...,sk) where each si is a name_string, representing edge names in g, defining a canonical path through the graph - g is a digraph_string - two properties, begin and end, are used to indicate the entry and exit edges to and from the graph respectively. They may be omitted only if the graph is simply a set of parallel alternative edges. - if path is unspecified, the canonical FASTA representation is as in a gap_object_string. Note that here we allow the digraph to be cyclic, however it is bounded by the constraint defined by l. 18
19 Semantics: it is guaranteed that at least one path from a begin edge to an end edge (and within the given size bounds) is present in the sample at the given locus. 17. Appendix: assemblies super_fasta_string := a fasta_string, except that instead of bases being a base_string, it is a concatenation of one or more basic_fasta_strings. super_digraph_string := a concatenation of zero or more super_fasta_strings By definition, an assembly is a super_digraph_string. 18. Appendix: required software Certain software would need to be supplied with FASTG: (1) Validator. This would determine if a given FASTG file is valid. (2) Flattener. This would take as input a FASTG record (or a snippet from such a record, such as a bracket expression), and produce as output a file providing the canonical FASTA for the FASTG optionally, a file of markups of the FASTA Both of these programs would need to be highly portable, to trivialize installation. For example, both could be self- contained perl scripts. Both programs would be versioned in accordance with the version of FASTG, together with minor versions corresponding to improvements within a given FASTG version. 19. Appendix: FASTG to FASTA conversion The FASTG format has been designed to make the conversion to plain FASTA as simple and easy as possible. Because the canonical FASTA sequence is given just prior to the FASTG construct that describes it in more detail, it is possible to convert a FASTG file simply by stripping out the FASTG constructs. These constructs are given between square brackets []. For example: >MyFirstContig; GTAAAAACTACATATAT G[1:alt G,C] TTTTTACACACAC Becomes >MyFirstContig; GTAAAAACTACATATAT G TTTTTACACACAC Where the spaces have been added for clarity, the canonical sequence is in red and the FASTG construct it is derived from is in bold. 19
20 This has the added advantage that the co- ordinate system between the source FASTG and the derived FASTA are consistent. 20. Appendix: FASTG markup FASTG is its own markup. After FASTG to FASTA conversion, it would be useful to retain the FASTG constructs in a separate file that could be cross referenced with the FASTA. Rather than define yet another markup format, the FASTG constructs are reused, but with the addition of a base offset. For the example from Section 18, the markup would simply be: >MyFirstContig; 17 [1:alt GC] Where the additional base offset is indicated in bold. The combination of this offset and the canonical sequence size (the 2 nd field) completely identifies the location of the FASTG feature in the FASTA file. It is therefore also possible to take a FASTA file, plus the FASTG markup and regenerate the original FASTG file. Since the fasta_string is reproduced exactly in the markup, the global graph structure and properties are also retained. The advantages of this markup system are its simplicity and easy of construction. A very basic tool could take a FASTG file and generate markup and FASTA, and vice versa. However, it is recognize that in some situations this markup format will be insufficient. In particular, the recursive nature of the FASTG format remains locked up in the markup, and has not been unrolled. It is imagined that other markup schemes will be proposed in the future to address this and other shortcomings, in addition to the development of tools to covert FASTG markup to existing annotation schemes. 21. References [1] Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 85 (1988), [2] Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183 (1990), David B. Jaffe, Iain MacCallum, Daniel S. Rokhsar, Michael C. Schatz 20
MORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
VISUAL GUIDE to. RX Scripting. for Roulette Xtreme - System Designer 2.0
VISUAL GUIDE to RX Scripting for Roulette Xtreme - System Designer 2.0 UX Software - 2009 TABLE OF CONTENTS INTRODUCTION... ii What is this book about?... iii How to use this book... iii Time to start...
Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
The Trip Scheduling Problem
The Trip Scheduling Problem Claudia Archetti Department of Quantitative Methods, University of Brescia Contrada Santa Chiara 50, 25122 Brescia, Italy Martin Savelsbergh School of Industrial and Systems
Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test
Math Review for the Quantitative Reasoning Measure of the GRE revised General Test www.ets.org Overview This Math Review will familiarize you with the mathematical skills and concepts that are important
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve
Genetic Algorithms commonly used selection, replacement, and variation operators Fernando Lobo University of Algarve Outline Selection methods Replacement methods Variation operators Selection Methods
Reading 13 : Finite State Automata and Regular Expressions
CS/Math 24: Introduction to Discrete Mathematics Fall 25 Reading 3 : Finite State Automata and Regular Expressions Instructors: Beck Hasti, Gautam Prakriya In this reading we study a mathematical model
Mathematics for Computer Science/Software Engineering. Notes for the course MSM1F3 Dr. R. A. Wilson
Mathematics for Computer Science/Software Engineering Notes for the course MSM1F3 Dr. R. A. Wilson October 1996 Chapter 1 Logic Lecture no. 1. We introduce the concept of a proposition, which is a statement
JavaScript: Introduction to Scripting. 2008 Pearson Education, Inc. All rights reserved.
1 6 JavaScript: Introduction to Scripting 2 Comment is free, but facts are sacred. C. P. Scott The creditor hath a better memory than the debtor. James Howell When faced with a decision, I always ask,
2) Write in detail the issues in the design of code generator.
COMPUTER SCIENCE AND ENGINEERING VI SEM CSE Principles of Compiler Design Unit-IV Question and answers UNIT IV CODE GENERATION 9 Issues in the design of code generator The target machine Runtime Storage
Web Development. Owen Sacco. ICS2205/ICS2230 Web Intelligence
Web Development Owen Sacco ICS2205/ICS2230 Web Intelligence Introduction Client-Side scripting involves using programming technologies to build web pages and applications that are run on the client (i.e.
RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
Moving from CS 61A Scheme to CS 61B Java
Moving from CS 61A Scheme to CS 61B Java Introduction Java is an object-oriented language. This document describes some of the differences between object-oriented programming in Scheme (which we hope you
Formal Languages and Automata Theory - Regular Expressions and Finite Automata -
Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March
Modeling Guidelines Manual
Modeling Guidelines Manual [Insert company name here] July 2014 Author: John Doe [email protected] Page 1 of 22 Table of Contents 1. Introduction... 3 2. Business Process Management (BPM)... 4 2.1.
Principles of Dat Da a t Mining Pham Tho Hoan [email protected] [email protected]. n
Principles of Data Mining Pham Tho Hoan [email protected] References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,
Regular Expressions and Automata using Haskell
Regular Expressions and Automata using Haskell Simon Thompson Computing Laboratory University of Kent at Canterbury January 2000 Contents 1 Introduction 2 2 Regular Expressions 2 3 Matching regular expressions
What are the place values to the left of the decimal point and their associated powers of ten?
The verbal answers to all of the following questions should be memorized before completion of algebra. Answers that are not memorized will hinder your ability to succeed in geometry and algebra. (Everything
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT?
WHAT ARE MATHEMATICAL PROOFS AND WHY THEY ARE IMPORTANT? introduction Many students seem to have trouble with the notion of a mathematical proof. People that come to a course like Math 216, who certainly
Dynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
DTD Tutorial. About the tutorial. Tutorial
About the tutorial Tutorial Simply Easy Learning 2 About the tutorial DTD Tutorial XML Document Type Declaration commonly known as DTD is a way to describe precisely the XML language. DTDs check the validity
MiSeq: Imaging and Base Calling
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
Bioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 11 Block Cipher Standards (DES) (Refer Slide
Regular Languages and Finite Automata
Regular Languages and Finite Automata 1 Introduction Hing Leung Department of Computer Science New Mexico State University Sep 16, 2010 In 1943, McCulloch and Pitts [4] published a pioneering work on a
Next Generation Sequencing: Technology, Mapping, and Analysis
Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University [email protected] http://tandem.bu.edu/ The Human Genome Project took
Zachary Monaco Georgia College Olympic Coloring: Go For The Gold
Zachary Monaco Georgia College Olympic Coloring: Go For The Gold Coloring the vertices or edges of a graph leads to a variety of interesting applications in graph theory These applications include various
Vilnius University. Faculty of Mathematics and Informatics. Gintautas Bareikis
Vilnius University Faculty of Mathematics and Informatics Gintautas Bareikis CONTENT Chapter 1. SIMPLE AND COMPOUND INTEREST 1.1 Simple interest......................................................................
COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing
COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing The scanner (or lexical analyzer) of a compiler processes the source program, recognizing
Why? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
Using simulation to calculate the NPV of a project
Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER
Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps
Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur
Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Karagpur Lecture No. #06 Cryptanalysis of Classical Ciphers (Refer
GenBank, Entrez, & FASTA
GenBank, Entrez, & FASTA Nucleotide Sequence Databases First generation GenBank is a representative example started as sort of a museum to preserve knowledge of a sequence from first discovery great repositories,
Streaming Lossless Data Compression Algorithm (SLDC)
Standard ECMA-321 June 2001 Standardizing Information and Communication Systems Streaming Lossless Data Compression Algorithm (SLDC) Phone: +41 22 849.60.00 - Fax: +41 22 849.60.01 - URL: http://www.ecma.ch
Summation Algebra. x i
2 Summation Algebra In the next 3 chapters, we deal with the very basic results in summation algebra, descriptive statistics, and matrix algebra that are prerequisites for the study of SEM theory. You
Levent EREN [email protected] A-306 Office Phone:488-9882 INTRODUCTION TO DIGITAL LOGIC
Levent EREN [email protected] A-306 Office Phone:488-9882 1 Number Systems Representation Positive radix, positional number systems A number with radix r is represented by a string of digits: A n
3. Mathematical Induction
3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
A Proposal for OpenEXR Color Management
A Proposal for OpenEXR Color Management Florian Kainz, Industrial Light & Magic Revision 5, 08/05/2004 Abstract We propose a practical color management scheme for the OpenEXR image file format as used
You know from calculus that functions play a fundamental role in mathematics.
CHPTER 12 Functions You know from calculus that functions play a fundamental role in mathematics. You likely view a function as a kind of formula that describes a relationship between two (or more) quantities.
Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison
Random vs. Structure-Based Testing of Answer-Set Programs: An Experimental Comparison Tomi Janhunen 1, Ilkka Niemelä 1, Johannes Oetsch 2, Jörg Pührer 2, and Hans Tompits 2 1 Aalto University, Department
Chapter 7. Functions and onto. 7.1 Functions
Chapter 7 Functions and onto This chapter covers functions, including function composition and what it means for a function to be onto. In the process, we ll see what happens when two dissimilar quantifiers
Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.
Communicating access and usage policies to crawlers using extensions to the Robots Exclusion Protocol Part 1: Extension of robots.txt file format A component of the ACAP Technical Framework Implementation
Graph Theory Problems and Solutions
raph Theory Problems and Solutions Tom Davis [email protected] http://www.geometer.org/mathcircles November, 005 Problems. Prove that the sum of the degrees of the vertices of any finite graph is
6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives
6 EXTENDING ALGEBRA Chapter 6 Extending Algebra Objectives After studying this chapter you should understand techniques whereby equations of cubic degree and higher can be solved; be able to factorise
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Pairwise Sequence Alignment
Pairwise Sequence Alignment [email protected] SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
11 Ideals. 11.1 Revisiting Z
11 Ideals The presentation here is somewhat different than the text. In particular, the sections do not match up. We have seen issues with the failure of unique factorization already, e.g., Z[ 5] = O Q(
Lecture 17 : Equivalence and Order Relations DRAFT
CS/Math 240: Introduction to Discrete Mathematics 3/31/2011 Lecture 17 : Equivalence and Order Relations Instructor: Dieter van Melkebeek Scribe: Dalibor Zelený DRAFT Last lecture we introduced the notion
KITES TECHNOLOGY COURSE MODULE (C, C++, DS)
KITES TECHNOLOGY 360 Degree Solution www.kitestechnology.com/academy.php [email protected] [email protected] Contact: - 8961334776 9433759247 9830639522.NET JAVA WEB DESIGN PHP SQL, PL/SQL
Applies to Version 6 Release 5 X12.6 Application Control Structure
Applies to Version 6 Release 5 X12.6 Application Control Structure ASC X12C/2012-xx Copyright 2012, Data Interchange Standards Association on behalf of ASC X12. Format 2012 Washington Publishing Company.
A simple criterion on degree sequences of graphs
Discrete Applied Mathematics 156 (2008) 3513 3517 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Note A simple criterion on degree
136 CHAPTER 4. INDUCTION, GRAPHS AND TREES
136 TER 4. INDUCTION, GRHS ND TREES 4.3 Graphs In this chapter we introduce a fundamental structural idea of discrete mathematics, that of a graph. Many situations in the applications of discrete mathematics
COLLEGE ALGEBRA. Paul Dawkins
COLLEGE ALGEBRA Paul Dawkins Table of Contents Preface... iii Outline... iv Preliminaries... Introduction... Integer Exponents... Rational Exponents... 9 Real Exponents...5 Radicals...6 Polynomials...5
So let us begin our quest to find the holy grail of real analysis.
1 Section 5.2 The Complete Ordered Field: Purpose of Section We present an axiomatic description of the real numbers as a complete ordered field. The axioms which describe the arithmetic of the real numbers
The Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors. Content of the Precalculus Subpackage
Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft This article provides a systematic exposition
Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:
Chapter 3 Data Storage Objectives After studying this chapter, students should be able to: List five different data types used in a computer. Describe how integers are stored in a computer. Describe how
Retrieving Data Using the SQL SELECT Statement. Copyright 2006, Oracle. All rights reserved.
Retrieving Data Using the SQL SELECT Statement Objectives After completing this lesson, you should be able to do the following: List the capabilities of SQL SELECT statements Execute a basic SELECT statement
. 0 1 10 2 100 11 1000 3 20 1 2 3 4 5 6 7 8 9
Introduction The purpose of this note is to find and study a method for determining and counting all the positive integer divisors of a positive integer Let N be a given positive integer We say d is a
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
OPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi
Automata Theory Automata theory is the study of abstract computing devices. A. M. Turing studied an abstract machine that had all the capabilities of today s computers. Turing s goal was to describe the
When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want
1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very
Graphs without proper subgraphs of minimum degree 3 and short cycles
Graphs without proper subgraphs of minimum degree 3 and short cycles Lothar Narins, Alexey Pokrovskiy, Tibor Szabó Department of Mathematics, Freie Universität, Berlin, Germany. August 22, 2014 Abstract
SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
Exercise 4 Learning Python language fundamentals
Exercise 4 Learning Python language fundamentals Work with numbers Python can be used as a powerful calculator. Practicing math calculations in Python will help you not only perform these tasks, but also
Glossary of Object Oriented Terms
Appendix E Glossary of Object Oriented Terms abstract class: A class primarily intended to define an instance, but can not be instantiated without additional methods. abstract data type: An abstraction
The Fourier Analysis Tool in Microsoft Excel
The Fourier Analysis Tool in Microsoft Excel Douglas A. Kerr Issue March 4, 2009 ABSTRACT AD ITRODUCTIO The spreadsheet application Microsoft Excel includes a tool that will calculate the discrete Fourier
[Refer Slide Time: 05:10]
Principles of Programming Languages Prof: S. Arun Kumar Department of Computer Science and Engineering Indian Institute of Technology Delhi Lecture no 7 Lecture Title: Syntactic Classes Welcome to lecture
PGR Computing Programming Skills
PGR Computing Programming Skills Dr. I. Hawke 2008 1 Introduction The purpose of computing is to do something faster, more efficiently and more reliably than you could as a human do it. One obvious point
The 2010 British Informatics Olympiad
Time allowed: 3 hours The 2010 British Informatics Olympiad Instructions You should write a program for part (a) of each question, and produce written answers to the remaining parts. Programs may be used
APPROACHES TO SOFTWARE TESTING PROGRAM VERIFICATION AND VALIDATION
1 APPROACHES TO SOFTWARE TESTING PROGRAM VERIFICATION AND VALIDATION Validation: Are we building the right product? Does program meet expectations of user? Verification: Are we building the product right?
CONTENTS 1. Peter Kahn. Spring 2007
CONTENTS 1 MATH 304: CONSTRUCTING THE REAL NUMBERS Peter Kahn Spring 2007 Contents 2 The Integers 1 2.1 The basic construction.......................... 1 2.2 Adding integers..............................
A Catalogue of the Steiner Triple Systems of Order 19
A Catalogue of the Steiner Triple Systems of Order 19 Petteri Kaski 1, Patric R. J. Östergård 2, Olli Pottonen 2, and Lasse Kiviluoto 3 1 Helsinki Institute for Information Technology HIIT University of
Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy
Regular Expressions with Nested Levels of Back Referencing Form a Hierarchy Kim S. Larsen Odense University Abstract For many years, regular expressions with back referencing have been used in a variety
Local periods and binary partial words: An algorithm
Local periods and binary partial words: An algorithm F. Blanchet-Sadri and Ajay Chriscoe Department of Mathematical Sciences University of North Carolina P.O. Box 26170 Greensboro, NC 27402 6170, USA E-mail:
Mathematical Induction
Mathematical Induction (Handout March 8, 01) The Principle of Mathematical Induction provides a means to prove infinitely many statements all at once The principle is logical rather than strictly mathematical,
GRAPH THEORY LECTURE 4: TREES
GRAPH THEORY LECTURE 4: TREES Abstract. 3.1 presents some standard characterizations and properties of trees. 3.2 presents several different types of trees. 3.7 develops a counting method based on a bijection
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
Focusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
Python Lists and Loops
WEEK THREE Python Lists and Loops You ve made it to Week 3, well done! Most programs need to keep track of a list (or collection) of things (e.g. names) at one time or another, and this week we ll show
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
VISUAL ALGEBRA FOR COLLEGE STUDENTS. Laurie J. Burton Western Oregon University
VISUAL ALGEBRA FOR COLLEGE STUDENTS Laurie J. Burton Western Oregon University VISUAL ALGEBRA FOR COLLEGE STUDENTS TABLE OF CONTENTS Welcome and Introduction 1 Chapter 1: INTEGERS AND INTEGER OPERATIONS
Fast Arithmetic Coding (FastAC) Implementations
Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by
ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite
ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,
AP Computer Science A - Syllabus Overview of AP Computer Science A Computer Facilities
AP Computer Science A - Syllabus Overview of AP Computer Science A Computer Facilities The classroom is set up like a traditional classroom on the left side of the room. This is where I will conduct my
Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
Review of Basic Options Concepts and Terminology
Review of Basic Options Concepts and Terminology March 24, 2005 1 Introduction The purchase of an options contract gives the buyer the right to buy call options contract or sell put options contract some
CATIA Drafting TABLE OF CONTENTS
TABLE OF CONTENTS Introduction...1 Drafting...2 Drawing Screen...3 Pull-down Menus...4 File...4 Edit...5 View...6 Insert...7 Tools...8 Drafting Workbench...9 Views and Sheets...9 Dimensions and Annotations...10
(Refer Slide Time: 02:17)
Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #06 IP Subnetting and Addressing (Not audible: (00:46)) Now,
INTERNATIONAL TELECOMMUNICATION UNION
INTERNATIONAL TELECOMMUNICATION UNION ITU-T X.690 TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU (07/2002) SERIES X: DATA NETWORKS AND OPEN SYSTEM COMMUNICATIONS OSI networking and system aspects Abstract
