Insertion and deletion correcting DNA barcodes based on watermarks

Transcription

1 Kracht and Schober BMC Bioinformatics (2015) 16:50 DOI /s METHODOLOGY ARTICLE Open Access Insertion and deetion correcting DNA barcodes based on watermarks David Kracht * and Steffen Schober Abstract Background: Barcode mutipexing is a key strategy for sharing the rising capacity of next-generation sequencing devices: Synthetic DNA tags, caed barcodes, are attached to natura DNA fragments within the ibrary preparation procedure. Different ibraries, can individuay be abeed with barcodes for a joint sequencing procedure. A post-processing step is needed to sort the sequencing data according to their origin, utiizing these DNA abes. The fina separation step is caed demutipexing and is mainy determined by the characteristics of the DNA code words used as abes. Currenty, we are facing two different strategies for barcoding: One is based on the Hamming distance, the other uses the edit metric to measure distances of code words. The theory of channe coding provides we-known code constructions for Hamming metric. They provide a arge number of code words with variabe engths and maxima correction capabiity regarding substitution errors. However, some sequencing patforms are known to have exceptiona high numbers of insertion or deetion errors. Barcodes based on the edit distance can take insertion and deetion errors into account in the decoding process. Unfortunatey, there is no expicit code-construction known that gives optima codes for edit metric. Resuts: In the present work we focus on an entirey different perspective to obtain DNA barcodes. We consider a concatenated code construction, producing so-caed watermark codes, which were first proposed by Davey and Mackay, to communicate via binary channes with synchronization errors. We adapt and extend the concepts of watermark codes to use them for DNA sequencing. Moreover, we provide an exempary set of barcodes that are experimentay compatibe with common next-generation sequencing patforms. Finay, a reaistic simuation scenario is use to evauate the proposed codes to show that the watermark concept is suitabe for DNA sequencing appications. Concusion: Our adaption of watermark codes enabes the construction of barcodes that are capabe of correcting substitutions, insertion and deetion errors. The presented approach has the advantage of not needing any markers or technica sequences to recover the position of the barcode in the sequencing reads, which poses a significant restriction with other approaches. Keywords: Next generation sequencing, Barcodes, Mutipexing, Edit distance, Watermark codes, Insertions, Deetions, Sequence embedding Background Due to the steadiy increasing throughput on patforms for next-generation sequencing and dropping prices of commerciay avaiabe devices, DNA sequencing becomes broady accessibe for researchers. Since the output of bases in each sequencing run has reached giga to tera *Correspondence: [email protected] Institute of Communications Engineering, Um University, Abert-Einstein-Aee 43, Um, Germany orders within the ast few years, strategies for efficienty sharing the sequencing capacity has become of particuar interest. Mutipexed sequencing is a major key technique, that makes sequencing devices accessibe in parae: DNA sampes from different experiments can be pooed into batches and sequenced in parae in a singe sequencing run. Before joining different sampes it is mandatory to uniquey abe the DNA fragments. DNA barcodes, artificiay synthesized sequences of nuceic acids, are used as 2015 Kracht and Schober; icensee BioMed Centra. This is an Open Access artice distributed under the terms of the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the origina work is propery credited. The Creative Commons Pubic Domain Dedication waiver ( appies to the data made avaiabe in this artice, uness otherwise stated.

2 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 2 of 14 abes to tag the fragments and to separate the output of the sequencers according to the input sampes. The robustness of mutipexing in genera reies on the properties of the used barcodes and how we they are adapted to the underying sequencing protoco and patform. Essentia experimenta pre-processing steps, which are needed to prepare the DNA materia can cause errors on the target genomic sequence and the barcodes as we. Physica and chemica sequence modifications, e.g. fragmentation, igation, or copy procedures are known sources of such errors. These errors ead to cross-tak during demutipexing, i.e., sequences from different batches can not ceary be distinguished, which is of course highy undesirabe. Different constructions for barcodes have been proposed, for exampe those of Hamady et a. [1] and Bystrykh [2] are based on Hamming codes [3] or the approach of Krishnan et a. [4] based on BCH codes [5], to name just a few. For short engths it is even feasibe to appy brute-force search techniques, e.g. Frank [6] or Mir et a. [7], where some of the resuting codes of the atter approach even reach fundamenta bounds. The constructions mentioned so far are designed to correct substitution errors ony. From a conceptiona point of view a of them try to provide codes that maximize the so-caed Hamming distance between the individua code words. The Hamming distance between a pair of sequences measure the minima number of symbo-wise substitution that are needed to transform them into each other. But, some specific sequencing patforms are known to have exceptiona high numbers of insertion and deetion errors as reported for Roche 454 Pyrosequencing [8], PacBio sequencers [9] or Ion Torrent PGM [10]. See [11-13] for a comparison of these sequencing techniques. Hence, especiay for insertion and deetion prone devices one has to consider barcodes that are capabe to correct indes. Promising attempts to find barcodes that are robust to indes have been considered in [14,15] using the so-caed edit or Levenshtein distance (see [16] for an overview). For cacuating the distance between code words the Levenshtein distance takes insertion and deetion operations into account and is therefore better suited for appications where decoding based on Hamming distance fais. Unfortunatey, there is no code-construction known that directy gives optima codes in edit metric. Some greedy (ater evoutionary) agorithms has been proposed in [17,18] to find sets of barcodes of moderate size with high minima edit distance, additionay fufiing bioogica constraints. However, a practica decoding step for the obtained barcodes has not been addressed in the mentioned papers. This was ater done by [19], where it is stated that maximizing the edit distance for barcodes (within a sequence context) is a sub-optima or even wrong strategy. The context of a code words, which is simpy the sequence that contains the DNA tag pays an important roe. Due to indes the exact boundaries of code words can not be correcty recovered. This eads to additiona errors, if the sequence context was not incuded in the code construction. The DNA context at one end of a code word can be taken into account by using an adapted Sequence-Levenshtein distance, as proposed in [19]. In this manuscript, we provide an entire different perspective to obtain barcodes. We give codes based on concepts introduced by Davey and Mackay [20]. The origina watermark approach is aimed to synchronize and decode a continuous stream of arge binary data-bocks. In the domain of DNA codes we face additiona constraints, for which the origina concept is adapted. We finay give an exempary set of barcodes and provide an in siico appication, which shows that demutipexing based on the watermark concept is appicabe in the fied of next-generation sequencing. Basic concepts of watermark coding has aready been considered for data embedding in DNA [21,22], which is cosey reated to the barcoding approach for DNA sequencing. But, the transmission of bioogicay compatibe sequences through an evoutionary channe (in iving ces) is ony sighty simiar to the approach we consider in the present manuscript. Note, that search approaches ike [16,19] can be used to find better codes in terms of code rate and minima (sequence) edit distance, but we see two striking advantages of the watermark concept for barcoding. First, the watermark concept contains an impicit synchronization technique, that does not need preambes or markers to find boundaries of code words within an unknown sequence. Embedding of barcodes in an unknown context is not generaized in approaches considering barcodes based on (sequence) edit distance. A two-ended embedding of sequences is not refected in this metric. Furthermore, we are abe to give an optima decoding procedure, adapted to a specific error channe. In short, this enabes a maximum degree of freedom for existing as we as future experimenta settings. Decoding speed is the second important aspect of mutipexing approaches, as the number of barcodes and the avaiabe read-ength dramaticay increased within the ast few years. The decoding of barcodes based on watermarks aso provides an efficient method for fast decoding of arge scae mutipexing approaches. Methods The main concepts presented in this section originates from the ideas first given by Davey and MacKay [20]. The fact, that quaternary watermark codes can be appied for DNA sequencing was preiminary outined in [23], but sequence constraint on practica oigonuceotides can not be found in this conference paper. Let us denote some basic facts about barcodes first and postpone specific

3 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 3 of 14 sequence constraints to the section about reasonabe encoder settings. For appying the concepts of watermark codes on DNA barcoding we have to focus on the foowing constraints, with impications for the modifications we consider here. We highight our contribution to the basic concepts with the foowing barcode constraints: A barcode is a quaternary sequence. Therefore we wi propose generaized non-binary modes and concepts here (the origina channe was considers stricty binary). in genera embedded in other sequences. Hence we give an adapted transmission scheme and a nove approach to detect barcode boundaries (in [20] the watermark pattern is repetitive and symbos are decoded as stream). ength-imited. That consequenty restricts the number of code-constructions for which an adaption is practica (ong LDPC codes are use in the origina paper, not pausibe for short barcodes). We wi stress additiona contributions to the work of Davey and MacKay, where needed. Let us start with a mode for the sequencing process and continue with encoding and decoding based on watermarks. Sequencing mode We wi first define a communications theoretic mode to formay describe the barcoding appication. Substitution mode A simpistic communication theoretic mode for the sequencing of barcodes has been proposed in [7]. Namey, afixedbarcodewordb A N,withA = {A, G, C, T} as set of possibe symbos (nuceic acids), is entering a communication channe with output r A N,wherethe channe itsef is described by the conditiona probabiities to receive a string y if x was sent, for x, y A N. For our purposes this mode has to be extended into two directions: First, a barcode word b is assumed to be embedded into a randomy chosen context to obtain the sequence t (detais wi be given beow). Second, the sequence t is sent over a so-caed Sequencing Channe resuting in the received word r. The channe not ony substitutes symbos from t, but is aso abe to deete or insert symbos, hence the ength of r and t may differ. Embedding of barcodes Given a barcode word b of ength N the sequence t is obtained as the concatenation of two random sequences t pre, t post and the sequence b as foows t = t 1 t } {{ } δ+1 t δ+n t } {{ } L A } {{ } L, t pre b=b 1 b 2 b N t post with δ {0, 1,.., L N}. The sequences t pre and t post are assumed to have a random ength and the symbos are chosen uniformy at random, hence the ength L of t is a random variabe. We wi ater choose the engths of the embedding sequences according to a quasi-norma ength-distribution on integers (see Section Simuations). Barcoding and working hypotheses In this paragraph we ike to focus on two different paradigm for sequence embedding in our barcoding approach: First we ike to address the tota embedding of barcodes. We, for most sequencing scenarios, we find the barcodes next to fixed technica sequences, which are more reiabe due to sequence specific (bioogica) reactions, e.g. primer or adapter oigonuceotides. For sequencing experiments, where we can rey on the exact knowedge of the position of a barcode at one end, highy optimized code words has been proposed in [19]. As an extension of our work is possibe to incude t pre or t post as partiay known, nevertheess we want to restrict ourseves to have no prior knowedge about the context. The main gain from this very strict premise is a striking freedom of experimenta setups for which we can appy our concepts. For exampe on patforms with paired end sequencing we are abe to decode barcodes independent of the direction of reads not restricting the reads to start with a barcode symbo. The second aspect is the assumption of inherent barcoded sequences. In this manuscript we focus on the probem of decoding (discrimination of code word sequences), conditioned on the fact that a code word is present in the mutipexed sequences. The probem of detecting a barcode (if it is not guaranteed that every sequence contains one) is a more chaenging probem, which we want to avoid in the present assay. On codes based on sequence edit distance this extended probem is addressed for, e.g. the PacBio SMRT patform in [24]. We conjecture that the detection probem of barcodes based on watermarks can be soved in future investigations. Such investigation might aso ead beyond the barcoding for sequencing appications, which we wi address at the end of this manuscript. Nevertheess, we rey on barcoded sampes for the foowing considerations. Sequencing channe mode We define a very simpified mode for sequence errors and discuss some aspects of oversimpification in the next paragraph. Let us describe the processes invoved in sequencing as a memoryess quaternary channe, i.e. each symbo is handed independenty of others. This channe mode is specified by a set of parameters S, p i and p d that are integrated as foows: A transition matrix S R 4 4 which describes the substitution probabiities, and p i and p d to specify insertion and deetion probabiities.

4 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 4 of 14 The channe is modeed as an infinite state-machine on symbo eve, in which a symbo t i is queued to pass the channe and therefore wi undergo one of the foowing three events: With probabiity p i,thesymbot i remains in the queue and the received stream is proonged with a random inserted symbos where we assume an uniform distribution on A ={A, G, C, T}.Withprobabiityp d,the actua queued symbo is deeted. With probabiity p t = 1 p i p d the symbo t i is passed to a substitution channe which substitutes the symbo t i according to the transition matrix S, with transition probabiities S(r j, t i ) = Pr(r j t i ). In order to downsize the number of parameters in our mode, we wi consider S as symmetric 4-ary substitution matrix ater, i.e., we consider substitutions from a base into another with a singe error parameter (simiar to the mode proposed by Jukes and Cantor [25]). Nevertheess, the mode considered here woud be abe to mimic a refined channe, if exact empirica parameters can be considered. Empirica parameters for a channe mode Empirica data about sequence errors can be seen as the crux of a HMM based approaches in the fied of DNA sequencing, as predictions can ony perform as good as the underying assumptions. But, aside from advertising error rate of the big vendors of sequencing devices, rea estimates are rather rare to find in iterature. Further, there is no rea agreement about a common technique to mine such data in a correct way, e.g, using sequence aignment to predict error count is recenty in critic of over-fitting, due to predefined aignment costs with impact on the estimates. The commutabiity of substitutions, insertion and deetion events (a substitution is equa to a deetion foowed by an insertion) made things even more difficut. Additionay, there are some indications, that the sequencing channe is more compex as the mode we utiize in our approach: Sequencing errors seem to be highy depend on the sequence context, with extended impications on the distribution of symbos in sequenced reads. For Iumina this was shown, e.g. in [26]. We know that the DNA poymerase moecue is prone to bursts of insertions and deetions, if for exampe repetitive symbos (homopoymers) are present in the physica tempate sequence. Note, that the reiabiity of the approach presented here is sensitive to empirica channe parameters to obtain best estimates for demutipexing sampes. A stepwise refinement driven by the feedback of experimenta studies is mandatory to adapt watermark barcodes for the demands of different sequencing patforms. Demutipexing probem During the mutipexing step, different sampes are abeed with different barcodes, for exampe t is abeed with b (b embedded in t )andt contains b (see Figure 1). After Figure 1 Sequence embedding of barcodes. Two exampe sequences t and t abeed with barcodes b and b embedded in an unknown DNA context. passing the discussed sequencing channe the resuting sequences r and r can differ from t and t.asthebarcodes are possiby affected by errors, r and r might be associated to the wrong origin during the demutipexing step, what is caed crosstak. The encoding and decoding based on watermarks is abe to reduce such crosstak. Barcode construction on watermarks For the construction of barcodes we use concatenated encoding simiar to the scheme proposed by Davey and Mackay, which consists of the foowing two bocks (see Figure 2): An outer code C 1 with parameters [F q1, n 1, k 1 ] is a code of ength n 1, dimension k 1 and aphabet size q 1 (Gaois fied F q1 ), that provides a set of q k 1 1 code words. Note, that ong LDPC outer codes are used in [20]. In order to avoid the constructions of short LDPC codes for barcoding, we wi consider different outer codes ater. It is worth to mention, that minima distance and the abiity for soft decoding is the ony demands on outer codes. We consider information words c F k 1 q 1, which are mapped to inner code words d C 1 F n 1 q 1. The code C 1 provides a set of code words with a high minima Hamming distance. The n 1 k 1 redundant symbos are used to arrange outer code words as distant as possibe. Such a code can give a code rate of R 1 = k 1 n 1. The second bock consist of an inner encoder, which works in a compete inverse direction. An inner code C 2 is used to create barcode words that have a ow Hamming distance to a watermark sequence. The simiarity of a barcodes to this watermark pattern is utiized to gain synchronization as expained beow. The inner code adds redundancy to the code words by mapping each outer symbos d j F q1 to a sparse sequence representation s (j) Z n 2 4. The set of sequences s(j) with ow mean Hamming weight (number of non-zero symbos) can be seen as inner code C 2. The cardinaity of this inner code set is q 1 andtherateoftheinnercodecanbestatedas R 2 = og 2 (q 1) n 2 og 2 (4). By joining n 1 oftheseinnercodewords,a

5 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 5 of 14 sparse inner bock s = s (1) s (2) s (n1) of ength N = n 1 n 2 is generated. Abarcodewordb = s w A N is obtained via a symbo-wise adding of an arbitrary watermark sequence w to the sparse inner bock s using a fixed mapping of A = {A, G, C, T} onto Z 4, where addition is defined moduo 4 (expicit mappings in Additiona fie 1 section). The fina set of barcodes is denoted as C = C 2 C 1 throughout this manuscript. The code C gives a code rate R = k 1 og 2 (q 1 ) n 2 og 2 (4). Decoding The decoder consists of two bocks (see Figure 2): An inner decoder D 2, which utiizes a hidden Markov mode (HMM) and channe parameters H to provide symbowise ikeihoods. These are fed into the outer decoder D 1 that performs a maximum ikeihood decoding to obtain an estimate ĉ of the sent code word. We wi first define the HMM and expain how this can be used to find the ikeiest transmitted sequence (optima decoding). Afterwards we give a modified HMM, that enabes to estimate the boundaries of embedded barcodes and end this section with a suboptima symbo-wise decoding approach with owered compexity. HMM for decoding The basic idea of the HMM presented in this paragraph refers to considerations in [20]. To expain the function of the HMM for decoding it is hepfu to ignore the random context and the embedding of code words initiay. Therefore we assume t pre and t post to be absent and the transmitted sequence is exacty one barcode, i.e. t = t 1 t 2 t M=N = b. If no insertions or deetions occur in a channe the received sequence r = r 1 r 2 r L is as ong as the sent sequence t, i.e.l = M, but some symbos r i might differ from t i. Assuming that errors occur independent of the position and equay distributed, we can use fixed substitution probabiities to describe the channe. For channes with insertion and deetion events the symbo-wise fixed association t i r i is usuay ost. For exampe, for a singe insertion event at the i-th symbo, t i wi be associated to r i+1. A singe deetion event before transmitting the i-the symbo, wi shift t i to the symbo r i 1 in the received sequence. Obviousy, such errors accumuate during the transmission. One of the main probems of decoding is to estimate the number of insertions and deetions given a received sequence r. Therefore we define the drift x i at the i- th transmit symbo as (# insertions) - (# deetions) that occurredinthereceivedsequencebeforetakingt i into account. The drifts {x i } can be seen as the hidden states of an HMM. Further, we assume the received sequence r = r 1 r 2 r M = r (1) r (2) r (L) Figure 2 Transmission scheme for barcodes based on watermark codes. Consisting of three bocks: 1) outer coder, encoding index words c as code words of a inear code C 1 [ F q1, n 1, k 1 ] and soft-decoder D 1. 2) inner coder, with inner code C 2 (for sparse sequence representation), decoder D 2 (providing ikeihood vaues) and watermark sequence w (known to encoder and decoder). 3) embedding of barcodes in random context and channe with insertion, deetion and substitution errors.

6 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 6 of 14 to be assembed of sub-sequences r (i),asobservabes, based on the hidden state x i and the transmit symbo t i. Every state transition x i 1 x i causes an emission of a sub-sequences r (i), that is associated to the position i in t (in genera HMMs the emissions are associated to singe states and not to transitions, compare with Figure 3). To characterize the transition probabiities among hidden states and the emission probabiities of observabes in the HMM, we use the foowing set of parameters H : {S, p i, p d, I}. Athough we used an identica notation for parameters as before (infinite state-machine in section Sequencing Channe), the channe mode and the HMM discussed here are not equivaent. The matrix S is parametrizing substitution events, with p i and p d we mode the probabiities of insertion and deetions and I is used to imit the maxima ength of observabes. For computationa reasons we suppose that ony I consecutive (eading) insertions can be present in observabes (the channe mode is capabe to insert an infinite number of symbos). We further assume that the ast symbo of an observabe r (i) is created by either deeting the actua transmit symbo t i or appending t i at the end, substituted according to S. Ift i is deeted, we can either just observe r (i) = ɛ (the empty symbo) or the ast symbo of the observabe wi be a random (inserted) symbo, if eading insertion are present. This imits the set of observabes in our mode to r (i) ɛ, t i, t i, t i, t i, t i,..., } {{ } I, whereweusewidcards to symboize random eading insertions. With t i Z 4 \ t i we denote symbos differing from t i. Now we give the transition and emission probabiities of the stated HMM. Due to our non-binary adaption, we use a sighty different notations than used in the origina approach in [20]. An eementary part for both Figure 3 Iustration of the HMM. Exempary series of transitions of the HMM and observabes r (j), based on state transition x j 1 x j and transmit symbos t j for j {i 1, i, i + 1}. t i probabiities is the ength distribution of observabes, that can be derived from 1 for = 0 α ( (p ) = i ) for 0 < 1 (p < I i ) I 0 ese, the probabiity of observing random insertions, conditioned on an observabe of ength. The probabiity for a sequence r (i) of ength is given by p() = α ()p d + α ( 1)p t, with p t = 1 p i p d. The sequence can be obtained via insertions and deeting t i or via 1 eading insertions and attaching t i (or t i )asastsymbo. The transition probabiity of the HMM can easiy be stated as Pr(x i x i 1 ) = p() by choosing = x i x i 1 + 1, i.e. two consecutive drift states determine the ength of observabes and vice versa. The joined probabiity Pr(r (i), x i x i 1 ) of observing r (i) with a state transition x i 1 x i is Q S (, r (i), t i ) = α () p 4 d + α ( 1) p 4 1 t S(r (i), t i ), with 0 I + 1. Expicity, r (i) can consists of random insertions and is not associated with t i (as t i is deeted with probabiity p d ). It is aso possibe to have 1 eading insertions and t i is substituted according to the substitution matrix S. The probabiity Q S can easiy be used to cacuate ikeihoods Pr(r H, t) as shown in the next section. Note, for the origina binary perspective in [20] no matrix representation of the substitution probabiity is needed. Here we generaize the basic concepts. Optima decoding Given a received sequence r = r 1 r 2 r L,thesetof possibe barcodes b = b 1 b 2 b N C and mode parameters H we can describe the decoding as the foowing maximization approach ˆb = arg max{pr(r H, b)}. b C The sequence r is most ikey to be captured, if ˆb is assumed to be the transmitted sequence and the channe exacty acts as described with the mode parameters H. LetusshortyexpainhowPr(r H, b) can be cacuated for the given HMM. The foowing methods can be found in common text books or, e.g. in [27], but we reca the cacuations as we focus on a very specia kind of HMM here. Therefore we have to associate the observabes of the HMM to received symbos, which is achieved with the drift states. If the barcode symbo b i isassumedtobethe i-th transmit symbo t i,thenb i can be associated to the i- th observabe r (i) by the HMM. The drift state x i can be used to associate the ast symbo r (i) of an observabe to

7 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 7 of 14 the symbo r i+xi in the received sequence r. Tocacuate ikeihood vaues we consider the so-caed forward metric F i (y) = Pr ( r 1 r i 1+y, x i 1 = y H, b ) as probabiity of having reached the drift state y and received symbos r 1 r i 1+y before considering the i-th transmitted symbo. In a transmission setting where t = b we can not have any non-zero drift state before considering the first symbo b 1.ThuswedefineF 1 (y) = Pr(x 0 = y) as F 1 (y = 0) = 1andF 1 (y = 0) = 0. The forward metric can easiy be cacuated as I+1 F i+1 (y) = F i (y δ )Q S (, r i+y, b i ), =0 with δ = 1, for 1 i N and 1 i+y M.Theikeihood vaues Pr(r H, b) can be obtained as F N+1 (L M). Using this optima decoding approach there is no need for outer decoding and demutipexing ends up with an estimated barcode as output of the inner decoder. For arge sets of code words this maximum ikeihood approach becomes computation expensive, as for an outer code of dimension k 1 and symbo aphabet of size q 1 there are q k 1 1 cacuations of the forward metric eft for decoding. Before we give a suboptima decoder with reduced compexity, we have to consider how sequence boundaries can be obtained for barcodes embedded in random context. HMM to estimate barcode boundaries Up to this paragraph we have not discussed the roe of the watermark in the transmission. We have just iustrated an HMM abe to give the ikeiest received sequences r conditioned to a hidden transmit sequence t and parameters H. Wewinowtakeacoserookatthewatermark and how the inner encoding can be understood from a communication theoretic point of view (see Figure 2). Reca that a barcode is constructed via addition of two quaternary sequences s, w Z N 4 as b = s w. Due to the symmetry of the addition, there are two ways to perceive a data transmission: Apparenty there is the transmission of s w b with w causing some distortion. A further vaid perspective of the transmission is w s b, which takes w as source with s causing substitution errors. Here we stick to the ast concept treating the symbos s i as independent and identicay (iid) distributed errors on w (which is used to find the boundaries of barcodes). Givenaniidassumptionforsymboss i and the inner code C 2 we can cacuate an extended substitution matrix S,withprobabiitiesS (b i, w i ) for pacing a symbo b i in the barcode, when a w i ispresentatpositioni.weusethe matrix S to define an upstream meta-channe that causes additiona substitutions to those generated by the rea sequencing channe. In anaogy to the previous paragraph, we can state emission probabiities ( ) ( ) Q E w i, r (i) = α () p 4 d + α ( 1) p 4 1 t E r (i), w i, for observing a certain r (i) if the symbo w i is present at position i.with E ( r (i), w i ) = b ( S r (i) ), b S (b, w i ) we denote the effective probabiity of having substitutions duetothesequencess and the substitution matrix S.The task of estimating barcode boundaries can be reduced to the estimation of the ikeiest sequence of drift states Pr(x 1 x 2 x N+1 H, r, w) in the HMM using Q E w i as we show in the next section. Finding barcode boundaries For embedded code words we can understand the symbo b i as shifted transmit symbo t δ+i and thus b i has to be inked to the observabe r (δ+i) in the HMM (see section Embedding of barcodes). But we can easiy integrate the sequence offset δ as initia drift x 0.Forthesymbosb i we therefore redefine the drift states {x i } for the HMMs according to embedded symbos. To cacuate ikeihood vaues for received sequences, we now consider the forward metric F i (y) = Pr(r 1 r i 1+y, x i 1 = y H, w) as probabiity of having reached the drift state y and received symbos r 1 r i 1+y before considering the i-th watermark symbo. We furthermore have to determine an initia distribution for the quantity F 1 (y) to be abe to cacuated the forward metric I+1 F i+1 (y) = F i (y δ )Q E ( ) w i, ri+y, =0 with δ = 1. We can furthermore cacuate backward quantities B i (y) = Pr(r i+1+y x i = y, H, w) as probabiity of receiving a certain tai of symbos starting with r i+1+y given a state y associated with the i-th watermark symbo, which can be cacuated as I+1 B i 1 (y) = B i (y + δ )Q E ( ) w i, ri+y+δ. =0 If we have a good guess for the distribution F 1 (y 1 ) and B N (y N ), i.e. an a priori distribution of having barcode boundaries at position y 1 respective N + y N,wemakeuse of it. To enabe an aignment of the embedded barcodes, we have to introduce nove prior distributions, sighty different to those proposed by Davey and MacKay. Anyway, a conservative approach is setting non-zero probabiities for

8 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 8 of 14 F 1 (y) = M 1,where0 y M and B N(y) = 1ondriftpositions N y M N. Herewehavetodifferfromthe origina concept [20], which is needed to detect a singe non-repetitive watermark within an unknown context. The ikeiest drift associated with the i-th embedded symbo is finay inferred as ˆx i = arg max y { Pr(y H, r, w) Fi+1 (y)b i (y) }. The estimates ˆx 0 and ˆx N are used to refer to the barcode boundaries in the received sequence. Symbo-wise ikeihoods We can finay perform a symbo-wise decoding as foows: The forward and backward metric does not ony provide estimates for the start and the end position of an entire barcode word, but aso enabes to cacuate conditiona ikeihoods ( P r H, w, s (j)) = F u (y)π u (y, z, s (j)) B v (z) y,z basedoninnercodewordss (j) C 2. With the indexes u = (j 1)n 2 + 1andv = jn 2 we denote deimiters of s (j) (compare Figure 2). The drifts y and z define potentia boundaries u + y and v + z of an emitted sub-sequence of r that is assumed to depend on s (j).withπ u (y, z, s (j) ) we symboize a truncated version of the forward metric, starting at states y andendingatstatesz. Fortheevauation ( of π we further consider emission probabiities Q S, r (i), w i s i ). As the inner code words are determined by outer code symbos, i.e. C 2 : d j s (j),we can easiy derive symbo-wise margina a posteriori probabiities P(d j H, r) from the conditiona ikeihoods. The symbo-wise marginas are finay utiized in the outer coder (see Figure 2). Using this suboptima inner decoding approach, we are abe to decrease the computationa costs to q 1 n 1 evauations of the forward metric π (compare section Optima decoding). As we need an additiona soft decoding for the outer code, there are further operations needed: For a maximum ikeihood approach we have to consider q1 k code words and search for the ikeiest soution. Simuations In order to perform an in siico appication of barcodes based on the watermark concept, we first have to define some reasonabe setting for encoding, which is aready quite chaenging. Reasonabe encoder settings As stated before (see section Barcode construction on watermark), there are different parameters, which infuences the concepts and for which we have to find a reasonabe setup to run simuations. First there is an outer codes, which shoud in combination with the inner coding ead to short barcode words, because we do not want to produce exceptiona overhead with mutipexing (tagging) target fragments. There is the minimum distance of outer code words and the sparsity of the inner encoder, which can independenty be characterized. And finay we have the watermark sequence, which can randomy be chosen. We utiize the degree of freedom with the watermark to incorporate with additiona sequence constraint, that barcodes shoud fufi to be experimentay vaid. Therefore we run a greedy search for the watermark pattern that maximizes the number of barcodes that meet a sequence constraint. But et us consider a particuar setting one by one in the foowing paragraphs. Suitabe outer codes For the construction of barcodes we focus on a targetength of N = n 1 n 2 [12,.., 25] symbos with n 1 {3, 4, 5, 6} and [ n 2 {4, 5, 6, ] 7, 8}. Further we imit the outer code C 1 Fq1, n 1, k 1,d H to the best known inear codes isted in [28] for severa cardinaities of Gaois fieds F q1, for which we considered q 1 {2, 3, 4, 5, 7, 8, 9}. Long LDPC codes has been used in the origina approach of Davey and MacKay (see [20]), but as the construction of short LDPC codes woud be somehow confusing for readers invoved in channe coding, we decided to use the best known inear codes. But we might note, that the hamming distance and the abiity for soft decoding is the ony demand on outer codes here and LDPC codes are ikey to perform equivaent. The minima Hamming distance d H of the best known inear codes we used is either maxima or the highest known regarding given parameters. Additionay we bound the dimension k 1 to achieve code set sizes 48 < C 1 =q k 1 1 < This guarantees a certain minima code rate on one side and imit the computationa effort of the outer decoding the other side. We end up at 263 possibe code configurations, but most of the resuting codes perform disastrous with the watermark concept, because the watermark is heaviy corrupted by inner code words. Sparsity of the inner code We consider the density (mean Hamming weight) as w H (C 2 ) = 1 w H (s) n(c 2 ) C 2 s C 2 for the inner code, with w H ( ) as Hamming weight, n( ) as ength and as cardinaity of the code. The inner code C 2 needs to exhibit a ow density, to keep the watermark shining through the barcodes, when inner code words are added. For arge densities there is no abiity eft to detect the barcode boundaries and consequenty decoding wi fai competey. Inspired by the approach in [29] we avoid

9 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 9 of 14 thea-zeroscodewordinc 2,butfurtherboundthedensity to w H (C 2 ) < 0.3, to keep the watermark present. Finay, we end up in a set of 73 parameter configurations for q 1, k 1, n 1 and n 2. For each of the 73 different parameter sets we run a brute-force search (10 7 trias), where we iterativey seected one inner code and watermark randomy to produce barcode sets. From the evauated settings we kept the one, which met the foowing sequence constraints. Sequence constraints We fitered for code words with unbaanced counts of symbos, to respect imitations on the GC content of barcodes. Such fitering can be seen as a de facto standard in the construction of barcodes (see for exampe [1,7,19]) and is reated to technica constraints due to the preparation and the sequencing of genomic materia. The reative frequency of a subset of two symbos shoud not be beow 40% and above 60% in each barcode, otherwise we excuded the barcode. We furthermore excude barcodes with prefect sef-compementation and more than two sequentia repetitions of the same base (homopoymer ength), simiar to the restrictions stated in [24]. We consider this settings as sufficient and strict enough to avoid experimenta probems during the preparation in rea sequencing tasks. Discarding such inappropriate code words means an additiona oss in code rate. Increasing the mean edit distance For decoding based on the HMM approach, edit distance impicity matters, thus we try to increase the mean edit distance of code word with a simpe strategy. For two sets of barcodes with identica counts of remaining barcodes (after fitering) we keep the one maximizing the mean edit distance d E (C) = 1 n(c)(n(c) 1) b i,b j C:i =j d E (b 1, b 2 ), where d E (b 1, b 2 ) denotes the edit distance [30] of barcodes b 1 and b 2, which can be understood as the number of singe-symbo sequence operations to transform b 1 into b 2 or vice versa. But, as the edit distance is bounded by the hamming distance, we do not gain a ot with this fina heuristic refinement step. Estimating the decoding error To evauate the refined set of 73 codes, we give the foowing demutipexing scenario. For each code we consider an error curve according the estimates of the decoding error probabiity on different channe settings. The estimation of a singe point in the error curve is based on a set of 200,000 barcodes, which we refer to as batch. Each batch is processed with a certain symbo mutation probabiity p mut [.01,.16]. This vaue determines a symbo-wise probabiity for an edit operation that can be caused by our channe mode. Simiar approach has been considered in [19], but we ike to be a bit more precise with the description of the modifications of the channe mode. Buschmann et a. [19] considered a minima mutation probabiity of 10 1 for their evauation, but others caim, that error rates are severa orders ower [11]. We wi take such indications into account. Each barcode from a batch is embedded in a random context of variabe ength (compare section Embedding of barcodes). We use norma distributed random variabes (with mean μ = 50 and variance σ = 5) to determine the ength of t pre as we as t post.wesimpytakethenearest non-negative integer to define the ength of the random context, which is uniformy distributed on {A, C, G, T}. We further use a state machine to produce erroneous received sequences based on the foowing four events: correct transmission C, substitution S, insertion I and deetion D. In sighty different notation to the former mode (see. Sequencing Channe) we assign probabiities to the events C and S and do not use conditiona probabiity ike a substitution matrix S. The probabiity for correct transmission C or substitution S is equa to p t in the former representation. The probabiity for C equas 1 p mut and the probabiity for any of the error events (S,I or D)isp mut. Every error event S, I or D is equa ikey. To obtain an equa distribution among the mentioned events, it is easier to use the present notation, but equivaent behavior can be approached with both versions of the channe state-machines. To save decoding time we iterativey pass each transmit sequence through the state machine, unti we have at east one error event within the barcode region. The probabiity of having a defective code word of ength N is Pr(def) = 1 (1 p mut ) N. We further considered an error-free barcode to be decoded perfecty, i.e. Pr(err def) = 0, with err denoting the event of decoding error. Pease note that this oversimpifies the fase positive rate introduced by sporadic simiarities of the context with dedicated barcode words. The rate is supposed to grow inear with the context ength and the size of barcodes set. Nevertheess, the probabiity for fase positive events exponentiay tend to zeros with the ength N of the code words. For barcodes of ength 12 embedded in 100 random symbos we consider this error as margina offset for the estimated decoding errors. Our evauation finay end up in estimating the conditiona error Pr(err def), which gives an estimate for the unconditioned error probabiity as P e = Pr(err def)pr(def). Resuts and discussion First we give a rough overview on meaningfu properties of watermark-based barcodes with the considered 73 parameterizations. Furthermore, we provide a refined

10 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 10 of 14 anaysis and evauation for certain codes in the aready defined decoding scenario. To minimize the textua eements in the figures, we use the foowing notation: q 1 k 1 n 1 n 2 indicates [ the ] concatenated coding using an outer code C 1 Fq1, n 1, k 1 and an inner code C2 : F q1 Z n 2 4. The particuar codes can be found in the Additiona fie 1 section of this paper. Properties of barcodes based on watermarks In Figure 4 we ink the principa characteristics as mean edit distance d E,meandensityw H and the cardinaity C 2 C 1 of barcodes in a comprehensive iustration. We use star-symbos to indicated the mean density (severa eves) and two dimensiona coordinates to ink mean edit distance and the size of code sets. There are distinct bocks, where the infuence of the inner code can be separatey examined. With fixing the outer code, e.g. q n 2 or q n 2 for q 1 {7, 8, 9} andincreasingtheengthn 2 of inner code words, one can deduce how code rate is exchanged for owered mean density and increased mean distance. However, we see configurations, for exampe n 2 for n 2 {4, 5, 6}, where we have not been abe to increase the mean distance by proonging the inner code. We have observed severa custers, where simiar effects can be found. For the codes q 1 k 1 4 n 2 or q 1 k 1 5 n 2 and q 1 k 1 6 n 2 the effect of the outer code can be separated. We find custers of star symbos at different mean distances (see darkened areas in Figure 4). These eves can be expained through the different minimum Hamming distance of the outer codes. We have Hamming distances 2,3 and 4 for the present outer codes at n 1 equas to 4,5 and 6. For concatenated coding it is known that the minimum distances of inner and outer codes are mutipicative [31]. As the edit metric is upper bounded by the Hamming distance, we anticipate the described eves for edit distances. The mentioned eveing can consistenty be found for a outer codes. Despitewehavemaximizedtheeditdistanceofbarcodes on average, it is aso interesting to focus on the pairwise distance of barcodes. To examine the distance in detai we utiize the so-caed distance distribution. In [32] the average number of code words at a certain distance to a fixed code word is considered as an usefu distance measure for non-inear codes based on Hamming metric. The edit distance distribution of a codes C consists of the numbers D e = 1 { (i, j) :d E (b i, b j ) = e, b i, b j C) }, M Figure 4 Properties of the examined barcodes based on watermarks. Star-symbos indicate the mean density and two dimensiona coordinates [ ink the ] mean edit distance d E and the size C 2 C 1 of codes. Parameters of the barcodes are abeed as q 1 k 1 n 1 n 2, denoting an outer code C 1 Fq1, n 1, k 1 and inner code C2 : F q1 Z n2 4. Codes evauated in refined anaysis/simuations (see. Figure 5 and 6) are coored.

11 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 11 of 14 where M = C ( C 1) and d E denotes the edit distance.infigure5weiustratethedistancedistribution of barcodes with parameters 8 3 n 1 n 2 (Figure 4, in bue). Apparenty, there are particuar pairs of code words with very ow edit distances, but as we reca the code construction based on inner code words with very ow Hamming weight, this fact is not too surprising. Nevertheess, some onger codes show a negigibe amount of such code words with sma edit distances. For instance, in the code ess than 1% of a possibe pairs of barcodes show an edit distance d E < 6. According to the very strict fitering (see Sequence constraints), we had to prune out the sets of q k 1 1 outer code words, what additiona owers the code rate. A detaied description about the excuded barcodes can be found in the Additiona fie 1. Evauation of decoding In Figure 6 we iustrate the estimates P e for the decoding error probabiity of different codes settings. We ran simuations for a 73 barcode set and ranked the sets according the decoding behavior. To give a rough outine for the performance of our approach we show the barcodes, which performed best (Figure 4 and 6 in green) and worst (Figure 4 and 6 in red). A barcode ength of 12 (codes q 1 k 1 3 4) seems to be insufficient to provide a good synchronization based on watermarks and thus the majority of decoding errors were found to be caused by synchronization issues (resuts not shown). The best performing sets of barcodes surprisingy have not occupied the maxima possibe ength, but 24 symbos. As there are ony 49 sequences avaiabe, this set of code words provides a poor code rate of og 2 (72 ) og 2 (4 24 ) = A reasonabe trade-off between error-correcting capabiity and cardinaity is provided for exampe by codes with parameters 8 3 n 1 n 2 (Figures 4, 5 and 6, in bue). Athough we are facing reative ow code rates (compared to approaches ike [19]) ranging from to 0.281, more than 400 sequences avaiabe with quite surprising decoding capabiity. Decoding compexity According to [20] we utiized a fast decoding approach with reduced compexity for our simuations. We bounded the maxima number of possibe drift states {x i } in a HMMs to {5,.., 20} according to the suggestion given by Davey and Mackay. The compexity for decoding a singe embedded barcode is O(MLI + q 1 N I + q k 1 1 ),with N denoting the ength of the barcode, that is embedded in a received sequence of ength M. Decoding is based on the assumption that the channe can introduce maxima I inserted symbos and we consider the maxima drift between received and transmitted sequence to be imited by. An order of MLI cacuation are needed to estimate barcode boundaries, n 2 I operations are needed to provide soft-vaues for each of the q 1 n 1 possibe outer code words (resut in q 1 N I) andq k 1 1 fina comparisons has to be spent for soft decoding the outer code in the most expensive case (maximum ikeihood). The prototype decoder with which we ran our simuationsisimpementedinmatlab. We further paraeized the decoding procedure and created jobs of 10 6 received sequences, that were processed by singe cores (Opteron, 2.6 GHz). The average ength of received sequences was in a range of 112 and 150 symbos, resuting in an average processing time of 6 hours for the tasks with owest cacuation costs (code parameters ). The ongesttime we needed to compete demutipexing of 10 6 sequences (code parameters ) was stricty beow 24 hours (on a singe core). Future directions Apart from the theoretica considerations we have given in this manuscript, there are ot of future direction starting from this initia point. Some of them are mandatory to enabe an appication in rea bioogica experiments, Figure 5 Exempary distance distribution of a sets of barcode based on watermarks. D e denotes the reative number of code-pairs with an edit distance equa to e. Each barcode set consists of maxima 512 code words, according to the examined outer codes C 1 [F 8, n 1,3] (see aso Figure 4, in bue). The distance distribution is normaized regarding the cardinaity of code words after fitering.

12 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 12 of 14 Figure 6 Simuation resuts for a reaistic decoding [ scenario. ] Estimated decoding error probabiity P e for barcode sets with parameters q 1 k 1 n 1 n 2, consisting of an outer codes C 1 Fq1, n 1, k 1 and inner code C2 : F q1 Z n2 4. A evauated codes are highighted in Figure 4 with identica coors. Each barcode set is tested for different symbo mutation probabiities p mut [.01,.16]. On average, each randomy drawn barcode is embedded in 100 random symbos before decoding. others are modifications of the concepts for extended appications. Let us first address the essentia steps needed to estabish an HMM based decoding in rea experiments: The HMM, as core of the decoding system, is the most sensibe part of the concept. It is mandatory to run experiments to gather reiabe data about a channes, the concepts shoud be used for. From our point of view there is a ack of reiabe data about insertions, deetions and substitution errors for possibe channe modes. For the sequencing appication we assume that different patforms show a variety of sequencing channes, additionay affected by experimenta parameters, e.g. the extend of PCR pre-processing of sampes. To obtain an optima suited decoder, the HMM shoud be adapted to the considered channes. As most of the channes show a correation of errors, more compex HMMs shoud be considered, refecting a channe mode with memory. Finay, it might be possibe to estabish a sef-adaptive agorithm to parametrize the HMMs without any prior knowedge about the ratio of errors in the channe. A suitabe statistic and refined caibration steps shoud be invented. Another important point for estimating the error characteristics is the construction of watermark codes. Exact empirica parameters of the channe coud be incorporated in the design of watermark codes to improve decoding steps, suited for specia channes. Further aspects that coud be considered with the given concepts are the foowing: Aside from the synchronization aspect in this manuscript, it seems very promising to use the maximum ikeihood decoding method for other sequences than watermark codes. Conditioned on good empirica parameters for an underying HMM one coud consider a reiabe detection of barcodes based on the Sequence-Levenshtein distance in a probabiistic way rather than based on sequence aignment. In the presented approach we focused on the discrimination of code words, assuming codes are aways present in inspected sequences. The detection of code words within DNA context is another big issue that shoud be soved for future investigations using an HMM based decoding. Resent research shows that even for sequencing approaches the detection of barcodes is quite chaenging. In [24] they focus on a specific probem with certain setups on the PacBio SMRT patform. Caused by technica reasons, sporadicay barcodes are not present in the sequence data. Another interesting fied of appication of an HMM based sequence detection coud be cona studies, where the sequenced genome coud or coud not contain a predefined sequence, which was introduced in ancestor organisms. Concusion We proposed an adaption of the watermark concept of [20] for DNA barcoding. A generaized channe mode for sequencing and suitabe modifications of the decoder were defined. Moreover, we investigated in a strategy to choose watermark sequences and inner codes in a reasonabe way to enabe barcoding in ine with common experimenta requirements. We provide a code construction, considering the best known inear codes as outer codes and bioogica sequence constraints to fiter for suitabe code words, resuting in an exempary set of 73 different code sets ranging from 12 to 24 nuceotides. The codes are iustrated in a comprehensive scheme, highighting watermark specific parameters as we as the mean edit distance, to give an impression how watermark based barcodes coud be characterized. For a reduced set of codes

13 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 13 of 14 we finay evauated the demutipexing of sequences in a reaistic simuation scenario. Within this in siico evauation we coud show that barcodes based on watermarks can theoreticay be used for mutipexing. It is remarkabe, that even with very short watermark patterns we are abe to reiaby find the barcodes boundaries in order to discriminate different code words with an HMM based decoder. The probabiity of decoding errors, which finay eads to the undesired cross-tak phenomenon was found to be very ow. Other approaches that investigate barcodes with arge (sequence) edit distance [16,19] show significant higher code rates for shorter barcodes, but we have given an entirey different concept that aows for arge scae mutipexing approaches, aso abe to hande insertion and deetion errors. Moreover, we can provide the marker-ess synchronization based on watermarks, to recover the barcode boundaries. This synchronization concept provides an utimate degree of freedom for experimenta sequencing setups as we as for future appications, aso apart from the sequencing context. Additiona fie Additiona fie 1: We provide 73 fies containing a the examined barcodes. Further informations are given, ike: The watermark in decima and nuceic acid notation, as we as the mapping of the inner encoder and the used mapping from integers to nuceic acid Z 4 {A, G, C, T} are given in a preambe. We furthermore give the distance distribution of barcode words. Finay, a barcodes are isted (pus expicit construction via code concatenation). Dropped barcodes, which do not meet the sequence constraints are expicity indicated. Competing interests The authors decare that they have no competing interests. Authors contributions ST initiated to work on a generaized channe mode for the concepts in [20] and inspired an adaption for DNA barcoding. DK accompished the main deveopments on genera methods, the decoder and the simuation framework. DK was aso assisted by the former master student Mahmoud Amarashi, who submitted a remarkabe thesis. DK and ST wrote the manuscript. Both authors read and approved the fina manuscript. Acknowedgements David Kracht is funded by the Deutsche Forschungsgemeinschaft (DFG) under the grant BO 867/30-1 in the priority program SPP Steffen Schober is supported by the DFG grant SCHO 1576/1. The authors ike to thank Mahmoud Amarashi for fruitfu discussions and assistance in preiminary considerations and managing preparative simuations. Received: 7 May 2014 Accepted: 29 January 2015 References 1. Hamady M, Waker JJ, Harris JK, God NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of sampes in mutipex. Nat Methods. 2008;5(3): Bystrykh LV. Generaized dna barcode design based on hamming codes. PLoS One. 2012;7(5): Hamming RW. Error detecting and error correcting codes. Be Syst Tech J. 1950;29: Krishnan AR, Sweeney M, Vasic J, Gabraith DW, Vasic B. Barcodes for dna sequencing with guaranteed error correction capabiity. Eectron Lett. 2011;47(4): Lin S, Costeo DJ. Error contro coding, vo Engewood Ciffs, New Jersey: Prentice-ha; Frank DN. Barcraw and bartab: software toos for the design and impementation of barcoded primers for highy mutipexed dna sequencing. BMC Bioinformatics. 2009;10(1): Mir K, Neuhaus K, Bossert M, Schober S. Short barcodes for next generation sequencing. PLoS One. 2013;8(12): Gies A, Megécz E, Pech N, Ferreira S, Maausa T, Martin JF. Accuracy and quaity assessment of 454 gs-fx titanium pyrosequencing. Bmc Genomics. 2011;12(1): Carneiro MO, Russ C, Ross MG, Gabrie SB, Nusbaum C, DePristo MA. Pacific biosciences sequencing technoogy for genotyping and variation discovery in human data. BMC Genomics. 2012;13(1): Bragg LM, Stone G, Buter MK, Hugenhotz P, Tyson GW. Shining a ight on dark sequencing: characterising errors in ion torrent pgm data. PLoS Comput Bio. 2013;9(4): Loman NJ, Misra RV, Daman TJ, Constantinidou C, Gharbia SE, Wain J, et a. Performance comparison of benchtop high-throughput sequencing patforms. Nat Biotechno. 2012;30(5): Shendure J, Ji H. Next-generation dna sequencing. Nat Biotechno. 2008;26(10): Yang X, Chockaingam SP, Auru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform. 2013;14(1): Adey A, Morrison HG, Xun X, Kitzman JO, Turner EH, Stackhouse B, et a. Rapid, ow-input, ow-bias construction of shotgun fragment ibraries by high-density in vitro transposition. Genome Bio. 2010;11(12): Qiu F, Guo L, Wen TJ, Liu F, Ashock DA, Schnabe PS. Dna sequence-based bar codes for tracking the origins of expressed sequence tags from a maize cdna ibrary constructed using mutipe mrna sources. Pant Physio. 2003;133(2): Faircoth BC, Genn TC. Not a sequence tags are created equa: designing and vaidating sequence identification tags robust to indes. PLoS One. 2012;7(8): Ashock D, Guo L, Qiu F. Greedy cosure evoutionary agorithms. In: Computationa inteigence, proceedings of the word on congress on, vo. 2. Piscataway: IEEE; p Ashock D, Houghten SK. A nove variation operator for more rapid evoution of dna error correcting codes. In: Computationa inteigence in Bioinformatics and computationa bioogy, CIBCB 05. Proceedings of the 2005 IEEE symposium on. Piscataway: IEEE; p Buschmann T, Bystrykh LV. Levenshtein error-correcting barcodes for mutipexed dna sequencing. BMC Bioinformatics. 2013;14(1): Davey MC, Mackay DJC. Reiabe communication over channes with insertions, deetions, and substitutions. Inf Theory IEEE Trans. 2001;47(2): Haughton D, Baado F. Biocode: Two bioogicay compatibe agorithms for embedding data in non-coding and coding regions of dna. BMC Bioinformatics. 2013;14(1): Haughton D, Baado F. A modified watermark synchronisation code for robust embedding of data in dna. In: Acoustics, speech and signa processing (ICASSP), 2013 IEEE internationa conference on. Piscataway: IEEE; p Kracht D, Schober S. Using the davey-mackay code construction for barcodes in dna sequencing. In: Turbo codes and iterative information processing (ISTC), th internationa symposium on. Piscataway: IEEE; p Buschmann T, Zhang R, Brash DE, Bystrykh LV. Enhancing the detection of barcoded reads in high throughput dna sequencing data by controing the fase discovery rate. BMC Bioinformatics. 2014;15(1): Jukes TH, Cantor CR. Evoution of protein moecuese. Mamm Protein Metab. 1969;3: Minoche AE, Dohm JC, Himmebauer H. Evauation of genomic high-throughput sequencing data generated on iumina hiseq and genome anayzer systems. Genome Bio. 2011;12(11): Rabiner L, Juang BH. An introduction to hidden markov modes. ASSP Mag IEEE. 1986;3(1):4 16.

14 Kracht and Schober BMC Bioinformatics (2015) 16:50 Page 14 of Grass M. Searching for inear codes with arge minimum distance In: Bosma W, Cannon J, editors. Discovering mathematics with magma reducing the abstract to the concrete. Agorithms and computation in mathematics, vo. 19. Heideberg: Springer; p Briffa JA, Schaathun HG. Improvement of the davey-mackay construction. In: Information theory and its appications, ISITA internationa symposium on. Piscataway: IEEE; p Levenshtein VI. Binary codes capabe of correcting deetions, insertions and reversas. Soviet Phys Dokady. 1966;10(8): Forney GD. Concatenated codes, vo. 11. Cambridge: MIT Press; MacWiiams FJ, Soane NJA. The theory of error-correcting codes, vo. 16. Amsterdam, Netherands: Esevier; Submit your next manuscript to BioMed Centra and take fu advantage of: Convenient onine submission Thorough peer review No space constraints or coor figure charges Immediate pubication on acceptance Incusion in PubMed, CAS, Scopus and Googe Schoar Research which is freey avaiabe for redistribution Submit your manuscript at