Probabilities and Probabilistic Models

Probabltes and Probablstc Models

Probablstc models A model means a system that smulates an obect under consderaton. A probablstc model s a model that produces dfferent outcomes wth dfferent probabltes t can smulate a whole class of obects, assgnng each an assocated probablty. In bonformatcs, the obects usually are DNA or proten sequences and a model mght descrbe a famly of related sequences.

Eamples 1. The roll of a s-sded dce s parameters p 1, p 2,, p 6, where p s the probablty of rollng the number. For probabltes, p > 0 and p 2. Three rolls of a dce: the model mght be that the rolls are ndependent, so that the probablty of a sequence such as [2, 4, 6] would be p 2 p 4 p 6. 3. An etremely smple model of any DNA or proten sequence s a strng over a 4 (nucleotde) or 20 (amno acd) letter alphabet. Let q a denote the probablty, that resdue a occurs at a gven poston, at random, ndependent of all other resdues n the sequence. Then, for a gven length n, the probablty of the sequence 1, 2,, n s n P (,, q 1 1 n ) 1

Condtonal, ont, and margnal probabltes two dce D 1 and D 2. For = 1,2, assume that the probablty of usng de D s P(D ), and for = 1,2,, 6, and the probablty of rollng an wth dce D s ( ). PD In ths smple two dce model, the condtonal probablty of rollng an wth dce D s: P ( D ) PD ( ). The ont probablty of pckng de D and rollng an s: The probablty of rollng margnal probablty P ( ) 2 1 P (, D ) P ( D ) P ( D ). P (, D ) 2 1 P ( D ) P ( D )

Mamum lkelhood estmaton (mamálne verohodný odhad) Probablstc models have parameters that are usually estmated from large sets of trusted eamples, called a tranng set. For eample, the probablty q a for seeng amno acd a n a proten sequence can be estmated as the observed frequency f a of a n a database of known proten sequences, such as SWISS-PROT. Ths way of estmatng models s called Mamum lkelhood estmaton, because t can be shown that usng the observed frequences mamzes the total probablty of the tranng set, gven the model. In general, gven a model wth parameters and a set of data D, the mamum lkelhood estmate (MLE) for s the value whch mamzes P (D ).

Model comparson problem An occasonally dshonest casno uses two knds of dce, of whch 99% are far, but 1% are loaded, so that a 6 appears 50% of the tme. We pck up a dce and roll [6, 6, 6]. Ths looks lke a loaded dce, s t? Ths s an eample of a model comparson problem. I.e., our hypothess D loaded s that the dce s loaded. The other alternatve s D far. Whch model fts the observed data better? We want to calculate: P (D loaded [6, 6, 6])

Pror and posteror probablty P (D loaded [6, 6, 6]) s the posteror probablty that the dce s loaded, gven the observed data. Note that the pror probablty of ths hypothess s 1/100 pror because t s our best guess about the dce before havng seen any nformaton about the t. The lkelhood of the hypothess D loaded : P ([ 6,6,6] Dloaded) Posteror probablty usng Bayes theorem Aposterórna pravdepodobnosť P ( X Y ) aprórna pravdepodobnosť 1 2 1 2 P ( Y X ) P ( X ) P ( Y ) 1 2 1 8

Comparng models usng Bayes theorem We set X = D loaded and Y = [6,6,6], thus obtanng The probablty P (D loaded ) of pckng a loaded de s 0.01. The probablty P ([6, 6, 6] D loaded ) of rollng three ses usng a loaded de s 0.5 3 = 0.125. The total probablty P ([6, 6, 6]) of three ses s Now, P P ([6, 6, 6] D loaded ) P (D loaded ) + P ([6, 6, 6] D far ) P (D far ). P Thus, the de s probably far. D loaded 6,6,6 P 6,6,6 Dloaded P D loaded P 6,6,6 3 0.5 0.01 D loaded 6,6,6 3 3 1 0.5 0.01 0.99 6 0.214

Bologcal eample Lets assume that etracellular (et ) protens have a slghtly dfferent composton than ntercellular (nt ) ones. We want to use ths to udge whether a new proten sequence 1,, n s et or nt. To obtan tranng data, classfy all protens n SWISS-PROT nto et, nt and unclassfable ones. et f a nt f a Determne the frequences and of each amno acd a n et and nt protens, respectvely. To be able to apply Bayes theorem, we need to determne the prors p nt and p et,.e. the probablty that a new (uneamned) sequence s etracellular or ntercellular, respectvely.

Bologcal eample - cont. et We have: P ( et ) q and n 1 P ( nt ) n 1 q nt If we assume that any sequence s ether etracellular or ntercellular, then we have P ( ) = p et P ( et ) + p nt P ( nt ). By Bayes theorem, we obtan P ( et ) P ( et ) P ( et ) P ( ) the posteror probablty that a sequence s etracellular. (In realty, many transmembrane protens have both ntraand etracellular components and more comple models such as HMMs are approprate.) p et p et q et q p et nt q nt

Probablty vs. lkelhood pravdepodobnosť vs. verohodnosť If we consder P ( X Y ) as a functon of X, then ths s called a probablty. If we consder P ( X Y ) as a functon of Y, then ths s called a lkelhood.