Object-oriented Bayesian networks for complex forensic DNA profiling problems

Transcription

1 Object-oriented Bayesian networks for complex forensic DNA profiling problems A. P. Dawid University College London J. Mortera P. Vicard Università Roma Tre September 6, 2005 Abstract We describe a flexible computational toolkit, based on object-oriented Bayesian networks (OOBNs), that can be used to model and solve a wide variety of complex problems of relationship testing using DNA profiles. In particular this can account for such complicating features as missing individuals, mutation, and null alleles. We show by example how to build a high-level representation of a disputed pedigree problem, and how to incorporate lowerlevel network models of the relevant complications. We illustrate the use of this toolkit with several examples, including disputed paternity with missing or additional measurements, and criminal identification. Using this technology, we investigate the effects on likelihood ratios of introducing mutation and/or null alleles, and show that this can be very substantial even when the underlying perturbations are very small. Some key words and phrases: Bayesian network, DNA profile, missed allele, mutation, null allele, object-oriented, paternity testing, silent allele. 1 Introduction DNA parentage testing and forensic identification are currently conducted using DNA profiles, comprised of several highly polymorphic short tandem repeat (STR) genetic markers each having a repertory of alleles ( repeat numbers ) that can typically be represented as small integers. The European standard AMPFlSTR R SGM Plus T M system uses ten such STR loci, plus amelogenin. All are on different chromosomes and so segregate independently. Polymerase chain reaction amplification now allows a profile to be obtained from very small amounts of DNA, even a single cell. For an account of the relevant biotechnology see e.g. Buckleton et al. (2004). The forensic impact of such DNA evidence is most appropriately captured by calculating the corresponding likelihood ratio for comparing a pair of competing hypotheses (Evett and Weir 1998; Morling et al. 2002). However, this can become extremely challenging, both logically and computationally, in the presence of additional complicating features such as missing data on some individuals, mixed trace evidence, mutation, null alleles, etc. For example, in a paternity case the true father may appear to be excluded, when in fact a mutation has taken place, or an allele has not been recorded. We have previously shown (Dawid et al. 2002; Mortera 2003; Mortera et al. 2003; Dawid 2003) how such complex problems can be addressed by structuring and analysing them with the aid of the computational technology of Bayesian networks (BN), also called Probabilistic Expert Systems (PES) (Cowell et al. 1999). These have been implemented in general purpose software such as Hugin 1. Research report No. 256, Department of Statistical Science, University College London. September Obtainable from Date: 1

2 A recent extension of this BN technology is the object-oriented Bayesian network (OOBN). This allows hierarchical definition and construction of a BN, utilising simple modular building blocks. Additional complexity can easily be introduced by adding new modules or refining existing ones. Object-oriented Bayesian network architectures have been described by Laskey and Mahoney (1997); Koller and Pfeffer (1997); Bangsø and Wuillemin (2000). In this paper we describe a construction set of basic OOBN modules for DNA identification, and show how these can be flexibly combined to handle a wide variety of complex problems. Our networks have been built using Hugin version 6.4. One specific complicating feature that we address is mutation, which can lead to a child having an allele that appears to have no source in either parent. Another is the possibility that observation of an individual s genotype can be incomplete on account of a null allele, i.e. one that is not detected by the measuring apparatus. We further distinguish between the cases where this property is non-inherited (when we term the allele missed ) or inherited (which we term a silent allele). An allele can be missed simply on account of sporadic equipment failure. A silent allele, on the other hand, might be the result of a mutation in the primer binding region, causing DNA amplification failure (Clayton et al. 2004). In this case only one allele is amplified and read, and the individual appears, wrongly, to be homozygous. This feature will be passed, by Mendelian inheritance, to a child, which, consequently, may again wrongly appear homozygous. We can thus easily have false evidence of exclusion, leading us to conclude, wrongly, that the alleged father is not the true father. We apply our networks to analyse a number of specific forensic cases. We find that properly accounting for a small probability of a silent allele can have a dramatic effect. In particular, in paternity testing where we can also observe the putative father s brother, this additional information can substantially change the probability of paternity in the presence of silent alleles. The paper is organized as follows. In 2 we describe a variety of problems of civil and criminal forensic identification, and represent them as high-level disputed pedigree networks. Section 3 shows how DNA identification in such problems can be implemented by treating these as objectoriented Bayesian networks, having further internal structure that can be expressed by means of lower-level networks as described in 4. Modifications of the lower-level networks to incorporate various complicating features, viz. mutation, silent alleles, missed alleles and combinations of these features are described in 5, 6, 7 and 8, respectively. In 9 we examine some numerical examples to illustrate the effects of taking proper account of the various complications considered. Section 10 presents further examples, showing the sometimes dramatic effect on the paternity ratio of accounting for silent alleles etc. when measurements can be obtained from relatives; while 11 presents a case of criminal identification. Closing remarks are given in 12. Appendix A develops some algebraic formulae for the paternity ratio, allowing for silent alleles, in a simple paternity problem when we can also observe the genotype of the putative father s brother. 2 Pedigrees We give particular attention to problems of testing paternity, or other family relationships, using DNA profile data. We always start by constructing a single pedigree to represent the relationships, whether known, assumed, or uncertain, between relevant individuals. 2.1 Nuclear family Figure 1 is a simple pedigree representation for a nuclear family consisting of father f, mother m, and one child c (colour-coded blue for male, pink for female). Both f and m are instances of type founder, having no parents represented in the pedigree, whereas c is an instance of type child, having both parents represented. Cases where, say, only the individual s father is known or observed can be handled by adding the unknown mother as an additional founder. 2

3 Figure 1: Pedigree for nuclear family 2.2 Simple disputed paternity In the simplest case of disputed paternity, we have an alleged family triplet formed by a disputed child c, its undisputed mother m, and the putative father pf. The hypothesis of interest, H 0, is that the putative father is the true father tf of the child; the alternative hypothesis H 1 is that the true father is some unobserved alternative father, af, treated as drawn at random from the population. A pictorial representation of this disputed pedigree is shown in Figure 2 (unobserved individuals being shown in a lighter shade.) Each of m, pf and af is a founder, while c is a child. To represent the disputed identity of the true father tf we describe him as a query individual, and include an explicit hypothesis node tf=pf? to indicate that we have a choice between pf and af. Figure 2: Pedigree for simple disputed paternity We may have DNA profiles from m, c, and pf, consituting evidence E. The impact of this evidence is carried by the likelihood ratio in favour of paternity: LR = Pr(E H 0 )/ Pr(E H 1 ). (1) If we make some standard assumptions Mendelian segregation, independent markers, known population allele frequencies this can be calculated by a simple and well-known algebraic formula (Essen-Möller 1938). 2.3 Missing individuals In more complex cases, DNA profiles may be missing for one or more members of the basic family triplet, but further information may be available in terms of profiles from known relatives. Forensic geneticists have not generally been able to handle such incomplete paternity data rigorously because of the more complex logical and computational analysis required. Figure 3 and Figure 4 relate to the two incomplete paternity cases described and analysed by Dawid et al. (2002). They are variations on Figures 3 and 5 of that paper, extended to incorporate explicitly all relevant individuals, whether observed or unobserved. In Case 1, as displayed in Figure 3, we have DNA from a disputed child c1, but not from its mother m1 nor from the putative father pf. We do however have DNA from c2, an undisputed 3

4 child of pf by a different, unobserved, mother m2, as well as from an undisputed full brother b of pf. The sibling relationship is made explicit by the incorporation of the (unobserved) grandfather gf and grandmother gm, parents of both pf and b. Nodes gf, gm, m1, m2 and af are all instances of founder; pf, b, c1 and c2 are instances of child; and tf is an instance of query. Case 2, displayed in Figure 4, is very similar, except that we now have DNA from both m1 and m2, and from two full brothers, b1 and b2, of pf. Figure 3: Pedigree for incomplete paternity case 1 Figure 4: Pedigree for incomplete paternity case Criminal identification Such genetic networks can also be used in certain criminal cases, as well as for identification of victims of disasters. The problem represented by Figure 5 is based on a real case. A body has been found, burnt beyond recognition, but there is reason to believe it might be that of a missing criminal cr. DNA is available from body, from the wife of cr, and from two children, c1 and c2, of cr and wife. The hypothesis node now indicates that cr might be identical to body; otherwise he is treated as an unobserved man, cr (unobs). Figure 6 describes a British cause célèbre, the case of James Hanratty (H) who was found guilty of murder and rape and hanged in In 1998 it was decided to apply modern DNA profiling technology to certain items of evidence from the original trial, which had been retained by the police, and a profile, taken to be from the culprit c (either H, or some other person o) was found. In an attempt to prove Hanratty s innocence, his mother m and full brother b offered themselves for DNA profiling. In principle this might have excluded Hanratty, but in fact did not do so: the associated likelihood ratio in favour of his having left the crime trace was about 440. In

5 Figure 5: Pedigree for criminal identification case Hanratty s body was exhumed, and it was found that his DNA did indeed provide a full match to the crime profile, yielding an updated likelihood ratio of about 2.5 million. Figure 6: The case of James Hanratty 3 Object-oriented networks for DNA identification So far we have merely described the type of problem we wish to address. In order to assess the impact of the evidence in any but the simplest of such problems we shall generally have to make use of sophisticated computational tools. Our approach is based on building Bayesian networks to represent the assumed structure. These then allow insertion of the evidence and propagation of its effect throughout the network. In particular, we can find its impact on the comparison of competing hypotheses, e.g. as to paternity. 3.1 Object-oriented Bayesian networks Dawid et al. (2002) showed how Bayesian networks can be built to represent problems such as described above, allowing one to obtain the correct likelihood ratio for the hypotheses based on all the available evidence. Here we describe a new, object-oriented, construction for such networks, which greatly simplifies and clarifies the specification process. Version 6 of the Bayesian network (BN) software system Hugin supports hierarchical definition of a BN, whereby any network can itself contain repeated instances of some other generic (class) network or networks. We use bold face to indicate a network class, and teletype face to indicate an instance or regular node. A class network is like a regular network, except that it can have interface input and output nodes as well as internal nodes. Interface nodes are indicated by a grey outer ring, an input 5

6 node having a dotted outline, and an output node a solid outline. Any network can have nodes that are themselves instances of other networks, in addition to regular nodes. Each instance of a class network within another network is displayed as a rounded rectangle, which can be expanded if desired to display its interface nodes; internal nodes remain hidden from view (although they can be accessed in run mode for entering findings or extracting updated probabilities). Arrows between nodes within the same network, or from output nodes to regular nodes in the containing network, represent, in the standard way, the probabilistic or functional dependence of that child node on its parents (Cowell et al. 1999). An input node can have at most one incoming arrow from a node in the containing network (which could itself be an output node of some other subnetwork): this is a binding link, indicating that these two nodes are to be identified. All instances of a class have identical probabilistic structure, save that the table for an input node is a default, being overwritten in any instance where that node is bound to a node of the containing network. Only output nodes can be parents of external nodes (either regular nodes of the containing network, or input nodes of other subnetworks). This architecture enables a convenient modular approach to problem specification. It is particularly natural and useful for genetic networks, where there is repetition, across different individuals, of such basic structures as Mendelian inheritance or mutation processes. Here we describe a set of simple class networks that can be pieced together as required, much like a child s construction set, to represent a wide variety of problems. A specific application of this modular construction process to a complex problem involving mutation has previously been described by Dawid (2003). Note that the object-oriented structure is used purely for problem specification and network construction. Within the software the network is expanded internally into a regular Bayes net (which can be output if desired). Once an object-oriented network has been constructed, it can be used for individual case analysis in essentially the same way as a regular network: see Dawid et al. (2002) for illustrations. After entering evidence, computation and analysis are effected by standard propagation algorithms (Cowell et al. 1999), initiated by means of simple mouse clicks. 3.2 Bayesian networks for DNA identification The pedigrees displayed in 2 above were constructed in Hugin 6.4. Over and above expressing family relationships, this allows us to describe the operation of genetic inheritance in detail. We do this in the context of forensic DNA profiles, each consisting of measurements on a collection of STR genetic markers (which we shall usually simply call gene ). An individual s DNA profile consists of measurements on a number of DNA markers. For each such marker we observe a genotype, comprising the unordered pair of values (alleles) for its constituent genes one maternally and one paternally inherited, although this distinction can not usually be observed. When these alleles are the same the individual is called homozygous at that marker, else heterozygous. Current technology utilises STR markers, which have a repertory of 8 20 alleles that can commonly be described by a small integer. For present purposes these can be regarded as measured without error, except for the specific possibility of silent or missed alleles, as treated in 6 ff. below. Each of our networks describes the inheritance of a single marker: distinct markers require distinct networks, but these will differ only in the details of the repertory of alleles, and their population frequencies. On entering the available DNA profile data for a marker we can use the system to calculate likelihood ratios for comparing hypotheses of interest. Throughout this paper we assume that the networks for different markers are entirely independent (given any of the hypotheses entertained), and calculate an overall likelihood ratio by simply multiplying the values obtained from each component marker network. Note that colouring of nodes is purely for presentational purposes and has no effect on the analysis. 6

7 3.3 Nuclear family In Figure 1, each of its three nodes was defined as an instance of another, generic, class network, having hidden internal structure. Both f and m are instances of a class founder, while c is an instance of a class child. In Figure 7, which is an expanded version of this network, we see that founder contains two output nodes: pg, representing the founder s paternally inherited gene, and mg, representing the maternally inherited gene. As for child, in addition to output nodes pg and mg as for founder it has input nodes fpg, fmg, mpg, mmg, representing respectively the child s father s paternal and maternal genes, and his/her mother s paternal and maternal genes. The arrows into these represent binding links, specifying that these are identical copies of the associated gene nodes in the two parental networks. Figure 7: Expanded pedigree for nuclear family The above class networks contain still further hidden structure, defining the nature of the inheritance process and of the observable quantities (genotypes). This will be described in 4 below. 3.4 Simple disputed paternity In Figure 2, m, pf and af are again instances of class founder, and c an instance of class child, exactly as described above. To model tf we need to construct a new network class query. Some details of this are shown in the partially expanded version of Figure 8. Internally, the output node tfpg is copied from either f1pg or f2pg, according as the Boolean variable tf=f1? is true or false; and similarly for tfmg. Input nodes f1pg and f1mg are bound to output nodes pg and mg of pf, while f2pg and f2mg are bound to output nodes pg and mg of af. Other connexions between the nodes in Figure 2 are made exactly as described in 3.3 above. We also include the explicit hypothesis node tf=pf?, bound to tf=f1?, in the top-level network: this node embodies H 0 or H 1 according as its value is true or false. We initially set these as equally likely, so that after propagation of evidence the ratio of their posterior probabilities can be interpreted as a likelihood ratio. 3.5 Further networks We now have all the ingredients to represent more complex problems, such as described in 2.3 and 2.4. All one has to do is to insert and connect together, in obvious ways determined by the basic pedigree, instances of the already constructed networks founder, child and query, as well as a hypothesis node. Armed with this construction set we can represent and so solve a very wide variety of problems involving DNA profiles and disputed identity. 4 Detailed structure We now give further details of the structure of the networks constructed above. 7

8 Figure 8: Partially expanded pedigree for simple disputed paternity 4.1 Network founder The internal structure of the network class founder is shown in Figure 9. The internal nodes pgin Figure 9: Network founder and mgin represent the random paternally and maternally inherited genes of the founder, and are themselves specified as instances of a class gene (not shown here), which consists of a single output node, also called gene. Associated with gene in this simple network is the appropriate repertory of allele values and their population frequencies. For our illustrations in this paper we use forensic marker VWA, having alleles ranging from 12 to 22 and probability table as given in Table 1. These are Austrian-German population allele frequencies. 2 The output nodes pg and mg of founder are specified as identical copies of the internal gene node of pgin and mgin, respectively. Such duplication is necessary only because of limitations of Hugin, which currently does not allow a node to be both an input and an output node, nor for an arrow to cross more than one level of the hierarchy. Finally the internal node gt of founder is an instance of the class genotype, as displayed in Figure 10. Here gtmin and gtmax are defined (by means of Hugin expressions) as the minimum Figure 10: Network genotype and maximum of the two input gene nodes pg and mg, and represent the observable genotype of an individual, being used for entering such genotype evidence when available we colour such 2 We are grateful to B. Brinkmann for supplying the data for Table 1. 8

9 an observation node in green. The input nodes pg and mg of genotype are bound to nodes pg and mg of founder. 4.2 Network child The internal structure of network class child is displayed in Figure 11. Figure 11: Network child On the paternal (left-hand) side, the input nodes fpg and fmg of child are bound to the input nodes pg and mg of an instance fmeiosis of a network class mendel. This in turn has an output node cg, which is then copied identically to the output node pg of child (again, such duplication would ideally be avoided but at present can not be). An identical structure holds for the maternal (right-hand) side of child. Finally pg and mg are fed into an instance gt of genotype, exactly as in founder, again allowing input of observed genotype data. Figure 12 shows the internal structure of mendel. Its internal Boolean node cg=pg? is mod- Figure 12: Network mendel elled as having a 50% chance of being true, in which case output node cg is identical with input node pg; else, when cg=pg? is false, cg is identical with input node mg. The effect is thus to transmit, at random, just one of the two parental genes, in accord with Mendelian segregation. 4.3 Network query The internal structure of network query is shown in Figure 13. This contains only the input and Figure 13: Network query 9

10 output nodes as described in 2.2 above. When tf=f1? is true, tfpg copies f1pg and tfmg copies f1mg; when false, tfpg copies f2pg and tfmg copies f2mg. 4.4 Analysis For case analysis the pedigree network describing a problem is used essentially as described in Section 2.2 of Dawid et al. (2002): each observed genotype is entered (as gtmin and gtmax) inside the instance gt of genotype within the relevant instance of founder or child. Then probability propagation is performed by the software, following which we calculate, as the ratio of the updated probabilities at node tf=pf?, the contribution to the likelihood ratio in favour of paternity based on these observations at this marker. The global likelihood ratio is obtained by multiplication of these contributions across all the markers measured. 4.5 Super-networks We can even treat a top-level network, such as triplet, as a class, and create one instance of it for each marker. Since Hugin does not currently allow modification of the states of a node when reusing a network, we must first set up a single repertory of coded states in gene, and specify appropriate correspondences with the actual alleles of the marker under consideration; the allele frequencies are likewise edited appropriately for each marker. The resulting marker networks can then be analysed separately, and their several likelihood ratios multiplied together. Alternatively all the single-marker networks can be explicitly combined as instances within a single super-network, with the node tf=pf? (now made into an input node) in each instance bound to a new top-level hypothesis node tf=pf?. Then after entering the evidence on all individuals at all markers, and propagating, we can obtain directly the global likelihood ratio from that hypothesis node. Such super-networks are not ideally suited to the propagation algorithm used by Hugin, since the links to the top-level hypothesis node can create very large cliques, and thus severe computational inefficiencies. External combination of marker-specific calculations is preferable whenever (as in the cases considered here) this is possible. However in some more complex problems, e.g. those involving quantitative analysis of mixed samples (Cowell et al. 2004), there are additional quantities common to all markers, and then such a super-network may be the only way to proceed. 5 Mutation It is easy to modify networks such as the above to account for possible mutation of genes in transmission from parent to child. We distinguish between a child s original gene cog, identical with one of the parent s own genes, and the actual gene cag available to the child, which may differ from cog because of mutation. Mutation network mut We must first construct a new class network mut to model the relevant mutation process. This network should have og as an input node, and ag as an output node. Revised network mendel We also modify the class mendel of Figure 12 as shown in Figure 14, renaming cg to cog (now made into an internal node) and binding this to input node og of an instance cag of mutation network mut. The output node ag of cag is then duplicated to supply the output node cg of mendel. The overall effect is that the output of mendel now represents the result of mutation acting on top of Mendelian segregation. As a very simple example, the network mut shown in Figure 15 implements the proportional mutation model: the actual gene ag is either identical to the original gene og, or else replaces that 10

11 Figure 14: Revised network mendel, incorporating mutation by a new gene sampled randomly from the population distribution, obtained from the output of an instance otherg of gene. The choice between these is made according to the outcome of a biased coin toss bcoin. Figure 15: Network mut for proportional mutation model For some mutation models we might wish to allow the mutation process to vary, according as it affects the paternal or the maternal line; in this case we need to incorporate a further Boolean input node p or m? in mut to specify the parental line. We then duplicate this in mendel, and bind these nodes together, as shown in Figure 16; and further modify child as in Figure 17, assigning probabilities 1 and 0 appropriately at nodes pline and mline (each bound to input node p or m? in the relevant instance fmeiosis or mmeiosis of mendel) to specify the relevant paternal line. Figure 16: Revised network mendel, incorporating mutation varying with parental line For more complicated mutation models there may be further internal structure, and/or adjustable parameters, in mut. As an example, Figure 18 represents a mixed mutation model (Dawid et al. 2001; Vicard and Dawid 2004). This chooses, as ag, either the original gene og, or a mutated gene, represented by an instance mutg of the class mutg of Figure 19. The choice is controlled by a coin toss bcoin, with bias determined by parameters xi, related to the overall mutation rate, and rho, which can be set to allow for differential mutation rates in the male and female lines. The mutated gene mutg is itself obtained by selecting between the outputs of the 11

12 Figure 17: Revised network child, incorporating mutation varying with parental line proportional mutation model propmutg, an instance of gene, and that of the single-step mutation model onemutg, an instance of onestep (not shown here). A parameter h determines the selection probability. For further details of this model see Dawid (2003). 3 Figure 18: Network mut for mixed mutation model Figure 19: Network mutg for mixed mutation model If we were only concerned with fixed values of the parameters, we could omit the parameter nodes and simply insert appropriate values into the conditional probability tables of the coin toss or other nodes that they affect. In that case we could proceed exactly as described above for the proportional mutation model. However, exploration of sensitivity to varying parameter values would then require direct editing of these conditional probability tables. To avoid this we have inserted explicit parameter nodes h, xi and rho, each having a discrete collection of numerical values we wish to experiment with, and specify the coin-toss probabilities etc. as algebraic expressions in these parameters. Since typically several instances of a network class containing such a parameter node will occur in the overall network, we need to ensure that any value set for the parameter is transferred to all those instances. The traverse instance feature of Hugin 6.4 enables this to be done easily. Once an appropriate network mut has been built, and mendel (and possibly also child) modified as described above, pedigree networks constructed as in 2 will now automatically incorporate the additional possibility of mutation. No other changes are required. 3 Our network mut corresponds to the network ag of Dawid (2003), while our parameter xi is twice the parameter lambda used there. 12

13 5.1 Non-stationarity A stationary mutation model is one for which the allele frequency distribution of a gene after mutation is identical with its distribution before mutation. The proportional mutation model described above is stationary, but in general the mixed mutation model is not. With non-stationary mutation, allele frequencies will change slightly from one generation to the next, and the very concept of a population allele frequency distribution dissolves into meaninglessness. A consequence of this is that we will get slightly different answers according as, say, our pedigree network does or does not include parents for node pf. For example, if we were to use the pedigree of Figure 3 to analyse the simple paternity problem of Figure 2, by inserting findings at m, pf and c, we would get a slightly different answer simply in view of the fact that a (now unobserved) brother is represented in the network. Various workarounds could be used to avoid this, but we have not felt it worthwhile following this route, on the grounds that there is no logically compelling reason to prefer raw over once-mutated, twice-mutated,..., frequencies, and the numerical differences will in any case be small (vanishing completely for a stationary mutation process). 6 Silent alleles 6.1 Background and assumptions A null or drop-out allele is one that is not recorded by the equipment used. When this can happen, what appears to be a homozygous genotype at some marker may not be so: an alternative explanation is that we are seeing just one band of a heterozygous genotype, the other band being null. This phenomenon will clearly affect the evidential interpretation of certain patterns of DNA profiles. Several papers in the literature have dealt with genetic aspects of dropout and how to allow for it in the analysis: Gill et al. (2000) develop formulae for the likelihood ratio, while dna view, a programme developed by C. Brenner, contains modules to perform the calculations. This phenomenon can occur for a number of reasons. One possibility is run-off, where the measuring apparatus used is simply unable to record certain allele values. Another is a mutation in the primer binding site, near to the target marker, leading to failure of the amplification process. In either of these cases a null allele will be inherited exactly like any other allele, distinct markers still being unlinked. We term such an inherited null allele silent. We construct networks to model and analyse this situation in 6.2 below. Clayton et al. (2004) found that about apparent mutations detected in paternity triplets were due to primer binding site mutations. They also suggest that such a mutation is likely to be preferentially associated with some specific allele or alleles of the target marker. For simplicity and demonstration purposes we have not taken account of this association, supposing instead that every allele has the same probability of becoming silent. Thus the models developed and the numerical values assumed here should be considered as purely illustrative: they are not recommendations for use in forensic laboratory casework. Another possible explanation for a null allele is sporadic failure of the apparatus to record the correct allele value. In this case the property is not inherited; we refer to such a null allele as missed. We describe how to handle this situation in Networks for inherited silent alleles We can construct Hugin networks to handle problems with inherited silent alleles by making minor modifications to the basic building blocks: specifically, to gene and genotype. We now make explicit use of the dummy value 99 to represent silence. Wherever any node in any network represents a gene, its state-space must be augmented with this value (in fact, to avoid further editing we already included this in our previous networks, giving it probability 0 in network gene). 13

14 Revised network gene The simple one-node network gene is now renamed gene0, and an instance gene0 of it is included in the new gene network shown in Figure 20. This has output Figure 20: Network gene for founder gene, incorporating silent allele node gene, equal to the output of gene0 unless the binary node silent takes the value 1, in which case gene is set to the silent value 99. The silence indicator silent is generated from Binomial(1, pr(silent)), depending on parameter node pr(silent): we have made this a discrete numerical node, so that we can vary its value (we consider values , , , , 0.001, and 0.01). The overall effect is that, with probability pr(silent), any original allele value is transformed into a silent allele. The probability of a silent allele is thus pr(silent), while initial real allele frequencies are multiplied by 1 pr(silent). A silent allele is inherited just like any other allele. Revised network genotype The network of Figure 10 for class genotype also needs to be modified, as shown in Figure 21, to account for the fact that silent alleles can not be seen in observed genotypes. Nodes pg, mg and gtmin are defined as before. Previous node gtmax is Figure 21: Network genotype, incorporating silent allele renamed gtmax0, while new output node gtmax is equal to gtmax0 unless this has value 99, in which case it is set equal to gtmin, so mimicking a homozygous genotype. If both alleles are silent so will be both gtmin and gtmax, and nothing will be seen an event which, though rare, has been known to occur (Clayton et al. 2004, Figure 1). Again, once we have made the above replacements of lower level networks, we can simply reuse top-level pedigree networks such as in 2 now automatically incorporating the possibility of silent alleles into these problems. 7 Missed alleles Modelling of sporadically missing alleles is just as straightforward. These only affect the way in which a genotype is observed. We now use 99 to represent an unobserved missed value. Observed allele network geneobs This new network, displayed in Figure 22, is very similar to that for gene in Figure 20. Node pr(missed) is a discrete numerical parameter node allowing us to set various values for the probability that an allele is missed (supposed independent of its 14

15 value). The binary missingness indicator missed has a Binomial(1, pr(missed)) distribution. Input node gene0 represents an actual allele value, while output node gene, the possibly missed gene, replaces this by 99 if missed takes value 1. Figure 22: Network geneobs for observed gene, incorporating missed allele Revised network genotype We also revise the network genotype of Figure 10, as in Figure 23. New nodes pgobs and mgobs are instances of geneobs, thus transforming pg and mg according to the missingness process. Nodes gtmin, gtmax0 and gtmax are obtained from the resulting, possibly missing, alleles exactly as described in 6.2. Figure 23: Network geneobs for observed genotype, incorporating missed allele Yet again, existing pedigree networks can be reused, so as now to allow for missing alleles. 8 Combination We can readily combine any or all the complicating features so far introduced, thus allowing for the possible simultaneous existence of inherited silent alleles, sporadic missed alleles, and mutation; all within a wide variety of top-level pedigree networks incorporating further complications such as missing individuals. We simply include all the appropriate new and revised networks needed for the various extensions (when combining both silence and missingness treated as operating independently we use the network genotype constructed for missingness). Further modifications can generally be introduced quite easily: for example, when combining mutation and silence we have chosen to modify mendel, adding an extra arrow from cog to cg, to ensure that mutation out of or into a silent allele is not allowed. In all circumstances the identical pedigree networks can be used. We have created a number of directories containing the appropriate lower-level networks for each combination of the above features. Using instances of founder, child, query, a pedigree network to describe a new problem can be constructed in any one of these, and simply dropped into any other, for immediate incorporation of the relevant additional features. 15

16 9 Examples We now illustrate the effects of accounting for either the separate or the combined effects of silent alleles, missed alleles, and mutation. All examples refer to marker VWA, with population gene frequencies as given in Table 1. We use the simple paternity pedigree network of Figure 2, extended, as described in 8, to allow for all the additional complications simultaneously. A mixed mutation model is assumed, with parameter values set to h = 0.9, rho = 0.5 and xi = (corresponding to a combined mutation rate of τ = ). When no mutation is allowed we set xi = 0. After propagating the evidence, node tf=pf? contains the posterior probabilities of paternity and non-paternity. We set the prior probability of paternity to 0.5, so that we can interpret the ratio of the resulting (purely nominal) posterior probabilities as the likelihood ratio in favour of paternity which we henceforth term the paternity ratio. In our examples both the child s and the putative father s genotypes are apparently homozygous. It is easy to see that (in the absence of mutation) if either the child or the putative father were heterozygous it would make no difference to introduce the possibility of a silent or a missed allele. Since a silent allele is inherited while a missed allele only affects the recorded genotype, allowing for silence will typically have a much greater effect than allowing for missingness. Example 9.1 The data are: m : {12, 20} pf : {18, 18} c : {12, 12}. Note that the child s observed allele 12 is extremely rare, having frequency p 12 = 0.03%; the mother s other allele 20 is somewhat less rare, with p 20 = 1.4%; while the putative father s observed allele 18 is common, with p 18 = 22%. Table 2 shows the combined effects of silence and missingness with no mutation. Comparing the column pr(missed) = 0 with the row pr(silent) = 0, we see that the effect of silence alone is roughly 5 times that of missingness alone. On passing from pr(silent) = 0 to pr(silent) = the value estimated by the American Association of Bloodbanks the paternity ratio goes from 0 to 3.53: instead of the evidence ruling the putative father out, when we introduce a small possibility of silence it actually favours paternity. Indeed, whenever pr(silent) all entries in the table give a paternity ratio greater than 1, favouring paternity (the additional effect of incorporating missingness in addition to silence being to reduce slightly the paternity ratio). Intuitively this is because, as soon as the probability of silence is comparable with that of allele 12, the child s apparently homozygous genotype is well explained as really being truly heterozygous {12, silent}. This in turn is readily explained under paternity if the putative father also has a silent allele. A similar explanation based on a (non-inherited) missed allele is however much less convincing. Table 3 shows the combined effect of silence, missingness and mutation. In the absence of silence or missingness, a 6-step mutation would be required to explain the data under paternity, and this is highly improbable under our mixed mutation model. Comparing Table 3 with Table 2 one in fact observes a negligible additional effect of allowing for mutation. Example 9.2 Now consider data: m : {12, 20} pf : {13, 13} c : {12, 12}. The mother s and child s genotypes are the same as in Example 9.1, while the putative father s observed allele is now the relatively rare allele 13, with p 13 = 0.2%. The combined effects of silence and missingness are displayed in Table 4. The impact of introducing the possibility of silence is overwhelming: for example, when pr(silent) = 0.01% the paternity ratio is 125. Compared with Example 9.1, the greater rarity of the putative father s observed allele now makes the presence of a silent allele still more plausible. However the sheer magnitude of this effect is perhaps unexpected. 16

17 The effect of missingness alone is, however, similar to that in Example 9.1. The additional effect of allowing for missingness over that of silence is to decrease the paternity ratio markedly so for pr(missed) The effect of further incorporating mutation can be seen in Table 5. Mutation by itself (pr(silent) = pr(missed) = 0) has quite an impact, giving a paternity ratio of 3.79; intuitively this is because paternity can now be well-explained by a 1-step mutation, and this is quite probable under the mixed model. This effect of mutation can still be seen when missingness is introduced, but essentially disappears as soon as silence is allowed. Example 9.3 The data are: m : {16, 16} pf : {18, 18} c : {18, 18}. The undisputed mother is apparently incompatible with the child: she must therefore have a missed allele, or have transmitted a silent or mutated allele to her child. Given that p 18 = 21% is much larger than any value considered for pr(silent) or pr(missed), we can be pretty sure, first that both pfgt and cgt are truly homozygous, and then that the child inherited allele 18 from its father. This has probability close to 1 under paternity, and to p 18 = under non-paternity. Correspondingly the paternity ratio is close to 1/ for any combination of the above explanations. This can be confirmed by calculations (not shown), using our networks. 10 Additional individuals Suppose that, in a simple disputed paternity case, the genotype bgt of the putative father s full brother b has been observed, in addition to those of the basic triplet m, pf and c. The relevant pedigree is as shown in Figure 24. Under simple Mendelian segregation this additional observation Figure 24: Pedigree for paternity testing with additional individual is independent of paternity status given the triplet evidence, and so makes no difference to the impact of that evidence. However, once we allow for a silent or missed allele the paternity ratio can be affected by knowledge of the brother s genotype, because it can help to distinguish whether the putative father is a true homozygote, or is truly heterozygous but with a silent or missed allele. The likelihood ratio in favour of paternity P based on just the triplet data D := (mgt, pfgt, cgt) is L D := Pr(D P ) Pr(D P ). (2) The impact of the additional information carried by the brother s data B := (bgt) is measured by L B := Pr(B D, P ) Pr(B D, P ), (3) 17

18 and the overall paternity ratio, taking account of both D and B, is LR := L D L B. (4) We can calculate L B directly by algebraic methods: this is developed in Appendix A. Alternatively we can compute L D and LR by numerical propagation, and thus derive L B from (4). Our computations were made using the pedigree network of Figure 24, together with appropriate lower-level networks to incorporate the effects of silence or missingness (we do not consider mutation here). Example 10.1 To illustrate the possible effect of the additional measurement B on the paternity ratio, we consider an example where the triplet evidence D is as follows: m : {12, 15} pf : {14, 14} c : {12, 12}. The putative father and child are both apparently homozygous, in a way that would be inconsistent with paternity under Mendelian segregation. However pf could still be the true father if he had a silent allele he might have passed to the child, or if one of his alleles was missed. Observation of his brother s genotype can help to shed light on these possibilities. Silent alleles. Table 6 displays the paternity ratio, allowing for silent alleles. The second column gives the paternity ratio L D based on the triplet data only. The later columns show the additional factor L B for various possible observations on the brother s genotype bgt. The behaviour of this term is determined by its relationship to the putative father s observed genotype pft. In columns 3 and 4 we consider bgt = {16, 20} and bgt = {12, 17}: b is heterozygous, and does not share any allele (and in particular, not a silent allele) with pf. As is verified in Case 1 (a) in Appendix A, the additional observation B makes no difference whatsoever in this case: L B = 1 for all values of pr(silent). However, when b is heterozygous but shares an allele with pf, the paternity ratio is reduced by this additional knowledge. Intuitively this is because it becomes more likely that pf is a true homozygote, and hence excluded from paternity. This effect is seen in columns 5 and 6 of Table 6 for the cases bgt = {12, 14} and bgt = {14, 17}, so that b and pf share allele 14. The fact that the additional paternity ratio factor is close to 0.5 is explained by the analysis of Case 1 (b) in Appendix A, since in our example we have q 14 p 14 = , considerably larger than the various values considered for pr(silent). That analysis also explains why the results are the same in both these columns. Column 7 refers to the case bgt = pfgt ( = {14, 14}). Since b could now have a silent allele the additional data do little to distinguish whether or not pf is a true homozygote. Indeed we see that the extra factor L B is very close to 1, and so essentially uninformative. This is explained in Case 2 (a) in Appendix A. Finally we consider the case that b is apparently homozygous, but with bgt different from pfgt. With such a configuration pf and b might still share a silent allele, and the additional observation B therefore renders it more probable that pf is a false homozygote, who could have passed a silent allele down to the child. As a consequence the paternity ratio is increased. In column 8 the brother exhibits a relatively common allele, bgt = {16, 16}, where p 16 20%. Even though this renders him likely to be a true homozygote, the effect on the paternity ratio of the uncertainty introduced by this extra information is to introduce a factor of around 6 for small p s, reducing somewhat as p s increases. In column 9 we take a very rare allele, bgt= {12, 12}, where p 12 = 0.03%. The increase in the paternity ratio is now dramatic. The values here reflect the analysis of Case 2 (b) in Appendix A, where it is shown that the additional effect is particularly strong when the allele of the brother is rare, but the silent allele is rarer still. The limiting value of L B as p s 0 here is , though to come close to this value p s needs to be less than The overall paternity ratio LR = L D L B achieves a maximum value of at p s =

19 Missing alleles. Table 7 illustrates the effect of observing the brother when allowing for missing alleles. Now the principal determinant of the additional effect of observing b is whether or not he shares an allele with c. Columns 3 (bgt = {16, 20}), 6 (bgt = {14, 17}), 7 (bgt = {14, 14}) and 8 (bgt = {16, 16}) involve cases where bgt and cgt have no common alleles. Since missing alleles occur independently in different individuals, observation of the brother carries very little additional information on paternity. In columns 4 (bgt = {12, 17}) 5 (bgt = {12, 14}) and 9 (bgt = {12, 12}) the brother and the child share allele 12. In this case, knowing that allele 12 is likely to be present in the paternal line, because it has been observed in the putative father s brother, makes it more probable that pfgt, observed as {14, 14}, was in fact {12, 14}, but with allele 12 missed. This argument is strengthened further when bgt = {12, 17}: whether this is a true homozygote or involves a silent allele, it provides evidence for pfgt truly being {12, s}. The strength of the effect is related to the rarity of allele 12. It decreases slowly as p s increases. Example 10.1 shows that when the possibility of silent or missed alleles is taken into account in a paternity testing problem where the putative father appears incompatible with the child, additional information on relatives of the putative father can have a dramatic effect on the paternity ratio. An effect can also be seen in compatible cases. Example 10.2 The triplet evidence D is now: m : {12, 15} pf : {13, 13} c : {12, 13}. Paternity ratios allowing for silent alleles are shown in Table 8. The values of L D in column 2 are much greater than 1 because the triplet is compatible, but they decrease as pr(silent) increases since it is then more likely that pf carries a silent allele. When bgt is also observed, its additional effect depends on its type. From column 6 of Table 8 we see that the there is no effect whatsoever when the brother is heterozygous with no allele in common with the child (bgt = {21, 22}); otherwise there is some effect, which is most apparent in column 5, where bgt is apparently homozygous but different from pfgt: it then becomes more plausible that pf is in fact heterozygous with one silent allele. The effect of allowing for missed alleles is shown in Table 9. In this case the most interesting configurations are those where b shares at least one allele with pf. In particular, column 4 shows that when the brother is heterozygous (bgt = {13, 16}), for larger values of pr(missed) the paternity ratio decreases, since it is then more likely that pf is truly heterozygous but with a missed allele. On the other hand if bgt = pfgt (= {13, 13}), the paternity ratio is increased by the additional information. 11 Criminal Case Here we analyse the criminal case represented by Figure 5. The identity of an unrecognisable body is unknown, and it is questioned whether it might be that of a criminal cr whose family had reported his disappearance. The DNA profiles of the criminal s family members his wife wife and their two children c1 and c2 were typed, and a DNA profile was also extracted from the bodily remains. Two different hypothetical cases are analysed below, to investigate the possible effects of allowing for silent and/or null alleles (we do not illustrate the additional effects of mutation, which were small). We again use marker VWA with allele frequencies as in Table 1. Example 11.1 The observed genotypes are: body : {16, 16} wife : {13, 14} c1 : {13, 13} c2 : {14, 14}. 19

20 Both c1 and c2 are apparently incompatible with being the children of body. Table 10 shows the likelihood ratio in favour of identity, body = cr, obtained by propagating the evidence in the network of Figure 5, incorporating lower level networks for silent and missed alleles. The likelihood ratio exceeds 1 for pr(silent) The effect of missingness alone is slight; when included in addition to silence it slightly reduces the likelihood ratio. Example 11.2 Here the DNA evidence is: body : {16, 16} wife : {13, 14} c1 : {13, 13} c2 : {14, 16}. The difference from Example 11.1 is that c2 is now compatible with being the child of body. Table 11 shows the results of propagating this evidence. When taking the possibility of silent alleles into account the general effect is, as might have been expected, to increase the likelihood ratio; however this is not so for small values of pr(silent) and pr(missing). The likelihood ratio again exceeds 1 when pr(silent) Additional allowance for missingness increases the likelihood ratio when pr(silent) , while for pr(silent) it slightly reduces the likelihood ratio. In both the above cases, an apparent exclusion can turn into strong positive evidence for identity as soon as we allow only a small probability of a silent allele. Allowing a small probability of a missed allele yields much weaker evidence in itself, but even here the overall effect of all the evidence could be strongly in favour of identity when there is no exclusion on any other marker. 12 Conclusions This paper has illustrated how object-oriented Bayesian networks can be fruitfully applied to solving complex problems of forensic DNA identification and paternity testing. The modularity and flexibility of the approach allows ready application to numerous different cases and complicating features. A significant application is to accommodate potential allelic drop-out. When a silent or missing allele is suspected, the ambiguity in the genotype can sometimes be resolved by retesting. In cases where this is impossible or proves ineffective, it has been common simply to discard the data (Leopoldino and Pena 2002), but it is better to perform an appropriate analysis that properly allows for the ambiguity. We have shown how this can be done using the computational methodology of OOBNs, and have used this to illustrate the sometimes striking impact of even very low levels of drop-out. In particular, as shown in 10, in the presence of silent alleles information on additional relatives can be very powerful in helping to resolve the ambiguity and assess the strength of the evidence. In this work we have used a very simple model in which the probability of allelic drop-out is independent of the actual allele value. In fact small alleles may be less affected by degradation and so less likely to drop out. Also, as suggested by Clayton et al. (2004), silence due to primer binding site mutation is likely to be associated with the allele repeat number. It should be relatively straightforward to incorporate such more realistic dependencies into our OOBNs. There are numerous further artifacts, such as stutter, drop-in etc., that can occur in DNA profiling and that we have not considered here. Again, most of these can modelled by modifications to our basic modular structures, along the lines already described. We hope to address some of these issues in future work. Another important area where this approach could be applied is in the analysis of low copy number (LCN) DNA, which is particularly sensitive both to drop-out and to possible contamination. Whitaker et al. (2001) found that under low copy number conditions approximately 10% per locus of all heterozygotes exhibit allelic drop-out. Object-oriented Bayesian networks will also be useful for analysing other problems of interest in forensic DNA identification. For example, Bayesian networks have been applied to the analysis of mixed DNA traces, where several individuals may have contributed to the DNA trace (Mortera et al. 2003; Cowell et al. 2004). In such cases allelic drop-out and other artifacts are known to occur quite often. Incorporation of these additional complicating features in modular object-oriented networks should be reasonably straightforward. 20