Comp Fundamentals of Artificial Intelligence Lecture notes, Speech recognition


 Gordon Reynolds
 2 years ago
 Views:
Transcription
1 Comp Fundamentals of Artificial Intelligence Lecture notes, Speech recognition Tim Morris School of Computer Science, University of Manchester 1 Introduction to speech recognition 1.1 The problem Speech recognition is a challenging problem in Artificial Intelligence. It is a solved problem in some restricted settings, for example when spoken words are limited to a small vocabulary. In this case speech recognition systems are already in common use, e.g. for command and name recognition in mobile phones, or for automated telephony systems. However, errorfree recognition of unrestricted continuous speech remains a difficult and unsolved problem. Current technologies can now achieve reasonable word accuracy and have been steadily improving over the last decade, but they are still some way from achieving performance comparable to human listeners. There is a great deal to be gained by improving these systems. For example, human transcription of speech, as required in certain areas of medicine and law, is expensive and tedious. Automated systems are already providing a substantial cost reduction in this area. Nevertheless, further improvements are required and speech recognition remains an active area of industrial and academic research. The most successful current approaches to speech recognition are based on the same probabilistic principles that you saw during the earlier lectures. One reason for this is that a probabilistic approach can be very effective when dealing with highly variable sensory inputs. This was true in the robot localisation problem, for example, where the robot only receives imperfect and corrupted information about its location; it is also true for sound signals received by a microphone or by our ears. However, in the case of speech this variability does not only stem from noise such as irrelevant sounds or sensor errors, although these sources of variability can severely affect speech recognition systems and should be taken into consideration. A much more significant source of variation in speech signals is due to natural variations in the human voice. A person can say exactly the same sentence in many different ways, with large differences in timing and modulation. Two different people are even more variable in the way that they may say the same word or sentence. A successful speech recognition system has to be able to account for these sources of variation and recognise that the same sentence is being spoken. This requires that the underlying model of speech being used is sufficiently flexible. In this course we will investigate how one can approach the speech recognition problem by constructing a generative probabilistic model of speech. By constructing a statistical model of how signals derived from speech are generated, it is then natural to use probability calculus in order to update our belief about what has been said. The type of probabilistic model that we will construct is called a Hidden Markov Model (HMM). HMMs are currently one of the most popular approaches for solving the problem of speech recognition. Before introducing the HMM, we first have to explain how speech can be represented by features that are useful for modelling. After introducing our choice of features, we will then show how a simple classifier can be constructed in order to differentiate the words yes and no. We will approach this problem by a number of different methods, which will highlight some important techniques that are of central importance in many Artificial Intelligence applications. We will start with a very simple featurebased classifier. Then we will consider a number of HMM approaches to the problem, introducing methods for classification and decoding in HMMs. You will be constructing a yes / no classifier in the associated lab sessions, where you will be able to see some real speech data firsthand. Finally, we will consider the problem of speech recognition in more complex settings. 1
2 1.2 Extracting useful features from sound waves Sound is carried in the air by pressure waves. In Figure 1 we show how a sound wave is typically represented, as changes in air pressure at a fixed location. Figure 1: A sound wave represents changes in air pressure over time at a fixed location. The vertical axis represents air pressure, increasing upwards. The horizontal axis represents time, which is increasing to the right. This representation of a sound wave is quite nice to look at, but it is not very useful for building a speech recognition system. In order to extract useful features from this signal we therefore have to use some techniques from a subject called signal processing. We will only give a rough sketch of these methods, since the details of signal processing algorithms are quite technical if you have any friends doing an Electrical Engineering degree then you can quiz them about the details. Hopefully you ll get the general idea. Our main interest in this course is what to do after the signal processing. Figure 2 shows how we begin to extract more useful features from a sound wave. We first break up the signal into a sequence of shorter overlapping segments. It is assumed that the sound wave is sufficiently uniform within each segment so that its main properties can be represented by the same vector of features. For sound waves corresponding to speech this assumption typically holds quite well. In the figure, F t refers to the vector of features at time t. Figure 2: Extracting features from a sound wave. After segmenting the signal, we now need to find a good set of features to describe the signal within each segment. A popular choice is to use something called a Fourier transform to extract the dominant frequencies of the signal within each segment. The details of the Fourier transform are quite mathematical and are beyond the scope of this course (although you are welcome to investigate them 2
3 in your own time if you are interested). However, a useful analogy is to think of a musical chord, which is a combination of notes at specific frequencies. The Fourier transform allows you to extract the individual notes from the chord and determine exactly how loud the contribution from each note is. The same principle can be applied to an arbitrary sound wave, except that in this case there may be contributions from a continuous range of frequencies. After some processing, we obtain a set of Melfrequency cepstrum coefficients (MFCCs) for each segment of the signal. In the labs you will be working with data representing 13 MFCCs. These are 13 numbers representing the contribution from different frequency bands obtained by a Fourier transform. There is a vector of 13 MFCC features for each segmented part of the speech signal. The Melfrequency bands are spaced so that the perceived frequency range of each band appears to be similar to the human ear. 1.3 Representing words After the signal processing stage, the sound is represented as a sequence of 13dimensional vectors, each representing a segment of the original sound wave. The speech recognition task is to label these in some way, so that we can identify the meaning of the spoken word or sentence. One approach would be to directly label the segments as representing a particular word; that is the approach we will be following in the lab and it works reasonably well for short words like yes and no. However, a better approach for longer words and sentences is to break them up into smaller and more elementary utterances called phonemes. Figure 3 shows the sequence of phonemes corresponding to the start of a sentence beginning The Washington (spoken with an East Coast US accent). A quick websearch will locate a number of different phonetic alphabets in common use. Sometimes it is useful to combine phonemes, e.g. a sequence of three phonemes is known as a triphone. Sound signal Extracted features Labelling corresponding phonemes Path through HMM Path through triphone HMM Figure 3: Representing a sentence as a sequence of phonemes. The bottom rows correspond to phoneme and triphone HMMs which we will discuss later. 1.4 Language models So far we have only considered the lowlevel processing of a sound wave in order to extract features that can be labelled as words or phonemes. Humans are also able to use their knowledge of language in order to improve their recognition of spoken words, e.g. if we miss a word, or don t hear it very clearly, then we may be able to guess what it was most likely to be. We have a sophisticated understanding of language and no artificial intelligence is yet able to understand natural language as well as we can. The problem of understanding natural languages is very challenging indeed, much harder than the problem of recognising individual spoken words. However, we can use simple statistical language models in order to help disambiguate cases that our speech recognition system finds difficult. The models can identify simple patterns in language, e.g. which words are more likely to follow a particular word in a given language. You may have a simple language model in your mobile 3
4 phone s predictive text mode. Probabilistic methods provide an excellent approach for combining evidence from language models with evidence from speech signals. 1.5 Training: learning by example The models that we will use have to be trained on real examples of speech. Training is the process whereby the parameters of a model are adapted to a particular problem domain, i.e. we fit the parameters to some exemplar data known as training data. Training is a fundamental part of many Artificial Intelligence applications and is the main focus of Machine Learning, a field of research that lies on the interface between Artificial Intelligence and Statistics. Many of the most successful realworld applications of Artificial Intelligence involve some kind of training. In the current problem the speech models that we will develop will be parameterised based on real examples of speech. 1.6 Back to the problem Now we have the building blocks that a speech recognition system works with. In the preprocessing stage we use signal processing to convert a sound wave into a sequence of MFCC feature vectors. Our job is then to label each of these feature vectors with the word or phoneme being spoken at that time. In the next few lectures we will investigate a number of different approaches for solving this task. 2 Building a featurebased classifier Before constructing a HMM for the speech recognition problem, we will first investigate a simpler approach based on analysis of a small set of features de scribing each recorded word. We will show how such a set of features can be used to construct a probabilistic classifier. A classifier is an agent which takes some input data, in our case features extracted from a speech signal, and predicts the class of the data. In our case the two possible classes are C1 = yes or C2 = no. Ideally, if the speech signal comes from a person saying yes, then the classifier should predict C1, while if the speech signal comes from a person saying no then the classifier should predict C2. In practice, it is unusual to obtain perfect performance and most methods will make mistakes and missclassify some examples. Our goal when building a classifier is to make as few mistakes as possible. As well as limiting the number of mistakes, it will also be useful for our classifier to represent its belief about the class of each example. Some examples are going to be easy and the agent will be quite certain about the classification. Some examples will be much harder to classify and the agent will be less certain. In an application it would be very useful for the classifier to be able to indicate that it is uncertain, since then the speaker could be prompted to repeat a word. As in the robot localisation example, we can use probabilities to represent the agent s belief in the classification. 2.1 The data and features In the lab you will be given 165 examples of recorded speech from different people saying yes and no. We will use this data in order to train our classifier. The raw sound signals are in WAVE files (with a.wav extension) and the corresponding sequence of MFCCs are in files with a.dat extension. The speech signals have been cropped in order to exclude the silence before and after each word is spoken. You will see how this cropping can be done using an HMM later. In order to extract a simple set of features for our classifier we will simply average each MFCC across all time segments. On the left of figure 4 we show the average of the 1st MFCC for the 165 examples given in the lab exercise. It can be seen that there is quite a large difference between the typical results for the different words and for different speakers. This feature looks like it may be useful for constructing a classifier. We see that all examples whose 1st MFCC is below 15 correspond to a yes, while all examples whose 1st MFCC is above 5 correspond to a no. When the 1st MFCC lies between these values then the example may either be a yes or a no. 4
5 Figure 4: On the left is a histogram showing the timeaveraged 1st MFCC from 165 examples of different speakers saying yes and no. On the right we show the signal duration for the same data (in seconds). On the right of figure 4 we show the duration of the signals for the same words. This feature looks much less useful for constructing a classifier. 2.2 Building a classifier from one feature Our general approach to building a classifier will be based on Bayes theorem. We will take C1 to mean the proposition the class is C1 and we will take x to mean the value of the feature is observed to be x. Then we can use Bayes theorem to work out the probability that the class is C1 given that we have observed the feature to be x, = (1) Here, p(c1) is our prior belief that the word is in the class C1. We have p(c2) = 1 p(c1) since we are assuming that the word must be in one class or the other. The prior will depend on the application. For example, if we were using the yes / no classifier for a cinema telephone booking system, then it could be reasonable to assume that the caller is not interested in most films and therefore set p(c1) to be quite small. In other applications the agent s prior belief will be equal for both classes and p(c1) = p(c2) = 0.5. In order to calculate p(c1 x) using the above formula, we have to know what p(x C1) and p(x C2) are. In the robot localisation example these were constructed by modelling the error made by the robot sensors as a normal distribution. In the present example we will use the same approach, but now we will take more care explaining exactly how to use data in order to estimate the parameters of the normal distribution. This is what we mean by training the model. The features we will consider are continuous numbers. That means that p(x C1) and p(x C2) are actually representing probability density functions rather than probability mass functions. The normal distribution is a very important and useful example of a probability density function and we describe it in detail now Probability densities and the normal distribution A probability density function p(x) represents the probability p(x)dx that the variable x lies in the interval [x, x+dx] in the limit dx 0. It is a nonnegative function and the area under the function is one, i.e. 5
6 The mean µ and variance σ 2 are defined to be, =1 =, = (2) where σ is known as the standard deviation and characterises the width, or spread, of the distribution. The normal distribution, also known as the Gaussian distribution, has the following probability density function, = (3) and the mean and variance are the two parameters of the distribution. Figure 5 shows two examples of normal density functions, with a different mean and variance in each case. Figure 5: On the left we show the density function p(x 1 ) for a normally distributed variable x 1 with mean 2 and variance 1. On the right we show the density function p(x 2 ) for a normally distributed variable x 2 with mean 4 and variance 0.5. The arrows indicate a width of two standard deviations (2σ) Training: fitting the normal distributions to data We will approximate p(x C1) and p(x C2) by normal distributions fitted to some training data. It is possible to fit a normal distribution to data by setting its mean and variance equal to the empirical mean m and variance σ 2 of the data. Given a set of N data points {x (1), x (2),..., x (n) } then m and s 2 are defined as = And = Note that the label n is a superscript in these expressions, not a power. (4) On the left of figure 6 we show two normal distributions fitted to the yes and no data histograms by setting µ = m and σ 2 = s 2 for each of the two classes of data. The histograms have been normalised now so that they represent frequencies in each class, and therefore they have area one. This provides a simple and useful way for us to estimate p(x C 1 ) and p(x C 2 ). This fit is not perfect, because it is unlikely that the data is exactly normally distributed and because we have quite a small 6
7 data set. However, the fit doesn t look too bad, and it provides us with a simple method to build a classifier. Figure 6: On the left are the normal density functions fitted to yes and no data. The mean and variance of the normal densities were set equal to the empirical mean and variance of MFCC 1 in the training data. On the right is the resulting classification rule obtained by assuming equal prior probability for each class Building the classifier Now that we have expressions for p(x C 1 ) and p(x C 2 ), learned from our training data, we can simply plug them into equation (1) and work out the required classification probability for any new example. The result is shown on the right of figure 6, where we have assumed equal prior probability for each class, e.g. p(c 1 ) = p(c 2 ) = 0.5. The classifier rule is intuitively reasonable. For small values of MFCC 1 the classifier is confident that the word is yes and the probability is therefore close to one in this regime. For relatively large values of MFCC 1 the classifier is confident that the word is no so that the probability of yes is close to zero. Inbetween these regimes the classifier is less confident about the correct classification. 2.3 Building a classifier from multiple features We can usually build a better classifier by considering more than one feature. Let bold x now represent a vector of features x = [x 1, x 2,, x d ]. This vector can be represented as a point in a d dimensional space; we therefore often refer to the examples in the data set as data points. As an example, figure 7 shows the same data as previously but now plotting two MFCCs for each data point. We can generalise to higher dimensions by considering further MFCCs. In this case it is harder to visualise the data, but exactly the same methods that we develop below can be applied. The probability of the feature vector x is interpreted to mean p(x) = p(x 1 Λ x 2 Λ... x d ) where x i is taken to mean the value of the ith feature is x i. We often write p(x 1 Λ x 2 Λ... x d ) as p(x 1, x 2,, x d ) so that the comma is interpreted as an and. This is natural when dealing with vectors, since the probability can be thought of as a function with d arguments. 7
8 Figure 7: If two features are used to build the classifier then the data can be represented as points in two dimensions. Each point can be considered a vector x = [x 1, x 2 ], where x 1 is the distance from the origin along the horizontal axis and x 2 is the distance from the origin along the vertical axis. We can easily generalise equation (1). The only difference is that the scalar quantity x is now replaced by the vector quantity x. By Bayes theorem again, = (5) In order to use this classification rule, we have to come up with expressions for p(x C1) and p(x C2). We will again approximate these by using normal distributions fitted to some training data. We will further assume that the features are conditionally independent given knowledge of the class. This means that we make the simplifying assumption that the probability density for the vector x can be written as a product of the densities for each feature, i.e. = = 8 = = In figure 8 we show an example of a normal density in two dimensions obtained by making this assumption. Each feature x 1 and x 2 has a normal density as given in the previous example (recall figure 5). In the bottom row we show the corresponding density function for the twodimensional vector x = [x 1, x 2 ]. Note that the volume under the density surface for a 2D vector is one, which is analogous to the condition that the area under the curve is one for the density of a scalar quantity. As before, we can estimate the mean and variance of p(x i C 1 ) and p(x i C 2 ) as the empirical mean and variance of each feature using the training data from each class. On the left of figure 9 we show the
9 contours of equal probability for normal densities fitted to the data from each class. Once we have approximated p(x C 1 ) and p(x C 2 ) in this way, we can then compute p(c 1 x) using equation (5) just as we did for the single feature example. We will again assume equal prior probabilities p(c 1 ) = p(c 2 ) = 0.5. The only difference to the single feature case is that now p(c 1 x) depends on a vector, which in our twodimensional example has two components x = [x 1, x 2 ]. In this case one can show the resulting classification probability as a surface, as shown on the right of figure 9. Figure 8: Examples of normal density functions in one and two dimensions. Top left: Density function p(x 1 ) for a normally distributed variable x 1 with mean 2 and variance 1. Top right: Density function p(x 2 ) for a normally distributed variable x 2 with mean 4 and variance 0.5. Bottom left: Density function p(x 1, x 2 ) obtained by making the simplifying assumption that p(x 1 Λ x 2 ) = p(x 1 )p(x 2 ). Bottom right: Contours of equal probability for the same density function. The type of classifier that has been developed here is an example of a naïve Bayes classifier. It is called naive because of the assumption that features are conditionally independent given knowledge of the class. This greatly simplifies the classification problem, but it is quite a crude approximation for many problems. However, this remains a popular type of probabilistic classifier and works well in many applications. More advanced classifiers have been developed and classification is quite a mature and highly developed research area in Machine Learning. 9
10 Figure 9: On the left we show the result of fitting two normal densities to the data points from each class. On the right is the resulting classification probability, assuming equal prior probability for each class. 3 Markov chains A Markov chain, also known as a 1st order Markov process, is a probabilistic model of sequences. It can be thought of as a generative model, i.e. a model of how sequences of objects can be produced in a random or probabilistic fashion. 3.1 Definitions An nth order Markov process specifies a probability distribution over sequences. Let s t be the object at time t in a temporal sequence of objects. The objects in the sequence belong to a finite set of states, s t S, known as the state space. In our case these states will be either words, phonemes or triphones, depending on the type of model we are discussing. In the case of words, the state space would be the vocabulary. In the case of phonemes, the state space would be the phonetic alphabet being used or some subset of the phonetic alphabet. Let s denote the whole sequence from s 1 to s T where T is the length of the sequence, s = s 1 s 2 s 3... s t 1 s t... s T. For an nth order Markov process, the probability of st being generated at time t is only determined by the n values before it in the sequence. In other words, the probability of the current element is conditionally independent of all previous elements given knowledge of the last n, p(st st 1, st 2,..., s1) = p(st st 1, st 2,..., st n). A zeroth order Markov process is an uncorrelated random sequence of objects. Throws of a die or coin tosses are examples of zeroth order Markov processes, since the previous event has no influence on the current event. This is not a good model for many realworld sequential processes, such as sequences of words or phonemes, where the context of an object has a large influence on its probability of appearing in a sequence. We are therefore typically more interested in the case where the process order n > 0. In this course we will mainly be dealing with 1st order Markov processes, commonly referred to as Markov chains, in which case 10
11 p(st st 1, st 2,..., s1) = p(st st 1) and the probability of the current element in the sequence is only determined by the previous one. Later we will see that higher order Markov processes can always be mapped onto a 1st order Markov process with an expanded state space; therefore it is sufficient for us to only consider the 1st order case and all the results carry over to the higher order case in a straightforward way. We will only consider homogeneous Markov processes, defined as those for which p(s t = i s t 1 = j) = p(s t = i s t 1 = j) for all times t and t. 3.2 Graphical representation A Markov chain can be represented by a directed graph, which is a graph where the edges between pairs of nodes have a direction. An example is shown in Figure 10. Here, there are two states a and b. The numbers associated with the edges are the conditional probabilities determining what the next state in the sequence will be, or whether the sequence will terminate. These probabilities are called transition probabilities. They remain fixed over time for a homogeneous Markov process. One can imagine generating sequences of a and b from this graph by starting at the START state and then following the arrows according to their associated transition probability until we reach the STOP state. Hence, we see that this is a generative model of sequences. The START and STOP states are special because they do not generate a symbol and therefore we do not include them in the state space S. Figure 10: Graphical representation of a Markov chain model of sequences with state space S = {a, b}. The information represented by Figure 10 is, p(s 1 = a) = 0.5, p(s 1 = b) = 0.5, p(s t = a s t 1 = a) = 0.4, p(s t = a s t 1 = b) = 0.2, p(s t = b s t 1 = b) = 0.7, p(s t = b s t 1 = a) = 0.5, p(stop s t = a) = 0.1, p(stop s t = b) = Calculating the probability of a sequence We can use the transition probabilities in order to work out the probability of the model producing a particular sequence s, 11
12 = = where we have defined s T+1 = STOP. For example, = = = = = = = Which is 0.5 times 0.4 times 0.5 times 0.1 or Normalisation condition Notice that the arrows leaving each state sum to one. This is the normalisation condition for a Markov chain model and can be written, =1 (7) This condition ensures that we make a transition at time t with probability Transition matrix An alternative way to describe a Markov chain model is to write the transition probabilities in a table. Table 1 shows the transition probabilities associated with the model in Figure 10. The matrix of numbers in this table (or sometimes its transpose) is referred to as the transition matrix of the model. Notice that the numbers in each row sum to one. This is the same as the normalisation condition, i.e. that the transitions from each state sum to one, also equivalent to equation (7) above. A matrix with this property is called a stochastic matrix. p(st st 1) st = a st = b st = STOP st 1 = a st 1 = b st 1 = START Table 1: This table shows the transition probabilities p(st st 1) between states for the Markov chain model given in Figure 10. Transitions are from the states on the left to the states along the top. 3.6 Unfolding in time We can display all of the possible sequences of a fixed length T and their associated probabilities by unfolding a Markov chain model in time. In Figure 11 we show the model from Figure 10 unfolded for three time steps. This explicitly shows all of the possible sequences of length three that could be produced and their probabilities can be obtained by simply multiplying the numbers as one passes along the edges from the START state to the STOP state. Notice that the transition probabilities leaving each state don t sum to one for this unfolded model because it does not represent all possible sequences. Paths leading to sequences that are shorter or longer than three are not included in this figure. 12
13 Figure 11: Unfolding the model shown in Figure 10 to show all possible sequences of length three. This unfolded model shows that there is an exponential increase in the number of possible sequences as the length of the sequence increases. In this example there are possible sequences of length 3. One can see this by observing that there are two choices for any sequence for t = 1, 2 and 3. There are 2 L possible distinct sequences of length L. This number soon becomes huge as L increases. We therefore have to be very careful to design efficient algorithms that don t explicitly work with this number of sequences. Luckily such algorithms exist, which is one of the reasons that Markov chain models and hidden Markov models are so useful. 3.7 Phoneme state models In Figure 12 we show a model where the states are the phonemes that combine to make the words yes and no. Each state has a selftransition, which allows the phoneme to be repeated a number of times so, for example, yes could be yehs, yehehs or yyehss etc. The reason for these selftransitions is that they can be used to model the duration of each phoneme in speech. Figure 12: Markov chain model with a phoneme state space S = {y, eh, s, n, ow}. The model generates phoneme sequences corresponding to the words yes and no. Each phoneme can be repeated a number of times to model its duration within a spoken word. p(st st 1) y eh s n ow STOP y eh s n ow START Table 2: This table shows the transition probabilities between states for the Markov chain model given in Figure 12. Most transition probabilities are zero indicating that these states are not connected. A matrix like this with many zero entries is called a sparse matrix. The transition probabilities for this model are given in Table 2. As before, the normalisation condition ensures that the rows sum to one. Notice that most of the entries in the table are zero, indicating that most of the transitions between states are not allowed. This sparse structure is typical of Markov 13
14 chain models for phonemes corresponding to a limited number of specific words, since the number of possible paths is then highly restricted. In Figure 13 the same model has been unfolded in time to show all possible sequences of exactly three phonemes. The probability of each sequence being produced by the model can easily be worked out by multiplying the transition probabilities along each path, p(yehs) = = , p(nnow) = = 0.09, p(nowow) = = Figure 13: The phoneme model from Figure 12 unfolded to show all possible sequences of length three. Since there are two ways to produce no then we can add them to obtain the probability that a sequence has length T = 3 and is no, p( no Λ T = 3) = = We can also work out the probability that a sequence of length T = 3 is a no by using the rules of probability that you should know, "no" T=3 "no" =3 = =3 "no" =3 = "no" =3 + "yes" =3 = This calculation shows that a sequence of length 3 is more likely to be a no than a yes in this model. At first this may seem to contradict the model, since Figure 12 shows that sequences representing the words yes and no are generated with equal probability. This can be seen by observing that the initial transitions to y and n are equally likely, so sequences starting with y and n are therefore equally likely. However, if we only consider short sequences then the model assigns greater probability to the word no. This word has less states than yes and the selftransitions from these states have lower probability, resulting in shorter words on average. Therefore, once we condition on the length of the sequence then the two words are no longer equally likely. 3.8 Efficient computation and recursion relations The above calculation is simple for short sequence. However, the number of possible paths through the model grows exponentially with the sequence length and explicit computation becomes infeasible. Luckily, the rules of probability provide us with an efficient means to deal with this in Markov chain models. Consider the problem of determining the probability that a sequence is of length T and a no, as we did above for the special case where T = 3. We note that p( no Λ T ) is equivalent to p(s 1 = n, 14
15 s T+1 = STOP) for the model in Figure 12 (we will use a comma in place of Λ to make the notation below more compact). We can write the sum over all possible sequences between s 1 = n and s T+1 =STOP as, =, = = =,,,,, = = = On the face of it, this looks like a sum over a huge number paths through the model. In the worst case this sum would require the calculation of S T 1 terms where S is the size of the state space. However, by evaluating these sums in a particular order, as shown below, we can calculate the sum much more efficiently:, = = = =, = =, =, = =, = =, = =, = Except for the first and last lines, we see that each line involves a sum over S terms which has to be worked out S times. Therefore, each of these lines requires a number of operations (multiplications and additions) roughly proportional to S 2. The number of these lines is T 2, so we see that the number of operations is at least (T 2) S 2. This is much better than an exponential scaling and allows the computation to be carried out efficiently for long sequences and quite large state spaces. For models where many states are not connected, such as typical phoneme models, then the computation time is even less. The reason for this is that the sums above only need to include connected states, since p(s t = i s t 1 = j) = 0 for two unconnected states i and j (recall Table 2). In the 2nd year algorithms course you will see more examples of algorithms like this, and you will learn how to calculate their computational complexity which determines how the computation time scales with L and S. We can write the above set of equations in a more compact form as a recursion relation,, = = = =, = =, = 2 where we have defined s T+1 = STOP. 3.9 Training It is possible to fit a Markov chain model to training data in order to estimate the transition probabilities. The simplest approach is just to use frequency counts from the sequences in the training data. Let Nij be the number of times that state j follows state i in the training data. We simply set the transition probabilities to be proportional to these counts, i.e. = = =,, (9) 15
16 The denominator ensures that the probabilities are properly normalised, i.e. the sum of all transition probabilities leaving any state is one. As an example consider the twostate Markov model in Figure 10. Imagine that the transition probabilities are unknown but that we have observed the following three sequences, abbaab ababaaa abbbba. We are asked to estimate the transition probabilities for the model from this training data. First we compute the counts for each type of transition, which we can put in a table as shown below. N i,j j = a j = b j = STOP i = a i = b i = START Then equation 9 shows that we should normalise the counts by the sum over each respective row to obtain the transition probabilities. p(st = j st 1 = i) j = a j = j = STOP i = a i = b 4/9 4/9 1/9 i = START Higher order models Figure 14: A 2nd order Markov process with state space S = {a, b} is equivalent to a 1st order Markov process with an expanded state space S = {aa, ab, ba, bb}. Notice that some transitions are not allowed. So far, we have only considered 1st order Markov processes. Higher order processes can capture even greater contextual information. In fact, all higher order Markov processes can be mapped onto a 1st order process with an extended alphabet. For example, if we have a fully connected 2nd order model with state space S = {a, b} then this is equivalent to a 1st order model with state space S = {aa, ab, ba, bb} with the transitions shown in Figure 14. The extended model is not fully connected, because there is a constraint that the states ending with a must be followed by states starting with a, and similarly states ending with b must be followed by states starting with b. The reason for this is that these states overlap in the original sequence, e.g. abbaaba ab bb ba aa ab ba. Since we can map a higher order Markov process onto a 1st order model, all the theory carries over in a straightforward manner. 16
17 A triphone Markov chain model is an example of a 3rd order phoneme Markov process mapped onto a 1st order process with an extended state space (recall Figure 3). The additional context provided by a triphone model can greatly improve the performance of the hidden Markov model speech recognition methods that we discuss next, because the context of a phoneme can strongly influence the way that it is spoken. 4 Hidden Markov Models Hidden Markov models (HMMs) combine aspects of the Markov chain model and featurebased classifier that you have seen in the previous two lectures. Underlying an HMM is a Markov chain model. The difference is that the states of the Markov chain cannot be observed; they are hidden or latent variables. Instead of producing a sequence of states, like a Markov chain, in the HMM each state emits a feature vector according to an emission probability p(xt st). All we observe are a sequence of feature vectors and we cannot be sure which state has produced them. Like a Markov chain model, an HMM is a generative model of sequential data, but the generated sequence is now a sequence of feature vectors, e.g. x 1, x 2, x 3,..., x t, x t+1... x T. In the speech recognition example this would be a sequence of MFCC feature vectors, for example. Figure 15 shows an example of an HMM that can be used to model utterances of the words yes and no with silence before and after. The transition probabilities have exactly the same meaning as for a Markov chain. They show that the words are equally likely and that the silence before and after is of the same typical duration. As usual, the normalisation condition ensures that the transition probabilities leaving each state must sum to one. The difference between this and a Markov chain model is that each state is now associated with an emission probability distribution, p(x t s t = SIL) p(x t s t = yes ) p(x t s t = no ) which describes the distribution of features when each utterance is being spoken. 4.1 Probability of a sequence of features and states It is straightforward to calculate the joint probability of a sequence of features and a sequence of states. Every time we make a transition to a new state, we Figure 15: An HMM for modelling yes and no utterances with silence before and after. have to multiply the probability that the current feature vector was emitted by that state, i.e. 17
18 ,,,,,,, = = (10) where s T+1 = STOP. The problem is, we don t know what the state sequence s = s 1, s 2... s T is. If we knew the state sequence, then we would already have solved the problem since we would know which word has been spoken. Our task is to make inferences only using the feature data. This task can be carried out by two approaches, described below, for which efficient algorithms exist. 4.2 Classification One approach is to compute the probability of the observed sequence of features for models corresponding to different classes of data. For example, we can construct an HMM for utterances of the word yes as shown in Figure 16. Figure 16: An HMM for modelling the word yes with silence before and after. As before we will call yes class C 1 and no class C 2. We can use this model in order to work out p(x 1, x 2,..., x T C 1 ). We can use a similar model of the word no and compute p(x 1, x 2,, x T C 2 ). Then Bayes theorem allows us to classify a sequence,,,, =,,,,,, +,,, As usual, we need to assign the prior p(c1) in order to use this classification rule. We also need a way of efficiently computing p(x 1, x 2,, x T C i ) for each model. We can use a similar recursion relation to the one for summing over paths in the Markov chain model, which was given by equation (8). For each model we calculate,, =,,,, =,,,, 2,,, =,,,, This is known as the Forward Algorithm in the HMM literature, because it involves iterating forward along the sequence from t = 1 until terminating at t = T. 4.3 Decoding A classification approach works well for distinguishing between a small number of possible words or phrases. However, if we want to recognise whole sentences then this approach is not feasible. We can hardly consider a different model for every possible sentence in the English language. Instead, 18 (11)
19 we have to use a decoding approach to the problem. We want to determine which state sequence s = s 1, s 2,..., s T corresponds to a particular sequence of feature vectors x 1, x 2,..., x T (recall Figure 3). A sensible choice is the sequence s* that is most likely given the sequence of feature vectors, =,,,,,, Here, argmax means that the quantity on the right is maximised with respect to the quantity written underneath. This is an example of an optimisation problem, since we are finding the state sequence that optimises a particular quantity. Optimisation is important in many areas of Artificial Intelligence. It turns out that there exists an efficient algorithm for working this out, the celebrated Viterbi Algorithm. It has a similar recursive form to the Forward Algorithm. The details are beyond the scope of this course, but if you are interested then there are some good HMM reviews available that discuss it. Rabiner s [2] is a classic and also discusses speech recognition (you can find it via the University website) while Durbin et al. [1] provide a nice introduction to HMMs in a different application domain. These are both at an advanced level. Viterbi decoding was used in order to remove the silence from the data you saw in Lab 2. The optimal path through a model like the one in Figure 15 was obtained for all 165 examples in order to identify which parts of the speech were associated with the SIL states. All the feature vectors associated with the SIL states were then removed from the MFCC data files. For Lab 3 you will be working with the original uncropped data. 4.4 Training The transition and emission probabilities can be estimated by fitting them to training data. If the training data sequences are labelled, so that the corresponding state sequences are known, then training is straightforward. In this case the transition probabilities can be calculated using counts in the same way as for the Markov chain model, while the emission probability densities can be estimated in a similar fashion to the normal density models described earlier. However, handlabelling of training data is difficult, time consuming and error prone. It is more usual that the training data is not labelled, although the sentence or words being spoken in the training data will usually be known. In this case training can be carried out using the BaumWelch algorithm. The details of this algorithm are beyond the scope of the course. It is a special case of the EMalgorithm, which is also very useful in many other areas of Machine Learning. Acknowledgements Many thanks to Gwenn Englebienne for providing the figures for the signal processing examples and for help in processing the speech data. References 1 R. Durbin, S.R. Eddy, A. Krogh, and M. Mitchison. Biological Sequence Analysis. Cambridge Uuniversity Press, L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77: ,
Hidden Markov Models. and. Sequential Data
Hidden Markov Models and Sequential Data Sequential Data Often arise through measurement of time series Snowfall measurements on successive days in Buffalo Rainfall measurements in Chirrapunji Daily values
More informationMarkov chains and Markov Random Fields (MRFs)
Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume
More informationHidden Markov Models
Hidden Markov Models Phil Blunsom pcbl@cs.mu.oz.au August 9, 00 Abstract The Hidden Markov Model (HMM) is a popular statistical tool for modelling a wide range of time series data. In the context of natural
More informationLecture 11: Graphical Models for Inference
Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference  the Bayesian network and the Join tree. These two both represent the same joint probability
More informationTagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Partofspeech (POS) tagging is perhaps the earliest, and most famous,
More informationLecture 10: Sequential Data Models
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Timeseries: stock
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More informationSignal Processing for Speech Recognition
Signal Processing for Speech Recognition Once a signal has been sampled, we have huge amounts of data, often 20,000 16 bit numbers a second! We need to find ways to concisely capture the properties of
More informationSenior Secondary Australian Curriculum
Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero
More informationHidden Markov Models for biological systems
Hidden Markov Models for biological systems 1 1 1 1 0 2 2 2 2 N N KN N b o 1 o 2 o 3 o T SS 2005 Heermann  Universität Heidelberg Seite 1 We would like to identify stretches of sequences that are actually
More informationChapter 15 Introduction to Linear Programming
Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 WeiTa Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationHidden Markov Model. Jia Li. Department of Statistics The Pennsylvania State University. Hidden Markov Model
Hidden Markov Model Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Hidden Markov Model Hidden Markov models have close connection with mixture models. A mixture model
More informationHardware Implementation of Probabilistic State Machine for Word Recognition
IJECT Vo l. 4, Is s u e Sp l  5, Ju l y  Se p t 2013 ISSN : 22307109 (Online) ISSN : 22309543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2
More informationMachine Learning I Week 14: Sequence Learning Introduction
Machine Learning I Week 14: Sequence Learning Introduction Alex Graves Technische Universität München 29. January 2009 Literature Pattern Recognition and Machine Learning Chapter 13: Sequential Data Christopher
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationA frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes
A frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes together with the number of data values from the set that
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationHidden Markov Models Fundamentals
Hidden Markov Models Fundamentals Daniel Ramage CS229 Section Notes December, 2007 Abstract How can we apply machine learning to data that is represented as a sequence of observations over time? For instance,
More informationConditional Random Fields: An Introduction
Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationPowerTeaching i3: Algebra I Mathematics
PowerTeaching i3: Algebra I Mathematics Alignment to the Common Core State Standards for Mathematics Standards for Mathematical Practice and Standards for Mathematical Content for Algebra I Key Ideas and
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationSampling Distributions and the Central Limit Theorem
135 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 10 Sampling Distributions and the Central Limit Theorem In the previous chapter we explained
More informationProbability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0
Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall
More information4. Introduction to Statistics
Statistics for Engineers 41 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation
More informationa 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
More informationAn introduction to Hidden Markov Models
An introduction to Hidden Markov Models Christian Kohlschein Abstract Hidden Markov Models (HMM) are commonly defined as stochastic finite state machines. Formally a HMM can be described as a 5tuple Ω
More informationMonte Carlo Method: Probability
John (ARC/ICAM) Virginia Tech... Math/CS 4414: The Monte Carlo Method: PROBABILITY http://people.sc.fsu.edu/ jburkardt/presentations/ monte carlo probability.pdf... ARC: Advanced Research Computing ICAM:
More informationAUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS
AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality texttospeech synthesis can be achieved by unit selection
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
More informationThe Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationCell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
More informationCONATION: English Command Input/Output System for Computers
CONATION: English Command Input/Output System for Computers Kamlesh Sharma* and Dr. T. V. Prasad** * Research Scholar, ** Professor & Head Dept. of Comp. Sc. & Engg., Lingaya s University, Faridabad, India
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationS = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)
last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving
More informationAPPLICATION OF HIDDEN MARKOV CHAINS IN QUALITY CONTROL
INTERNATIONAL JOURNAL OF ELECTRONICS; MECHANICAL and MECHATRONICS ENGINEERING Vol.2 Num.4 pp.(353360) APPLICATION OF HIDDEN MARKOV CHAINS IN QUALITY CONTROL Hanife DEMIRALP 1, Ehsan MOGHIMIHADJI 2 1 Department
More informationmax cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x
Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study
More informationBasics Inversion and related concepts Random vectors Matrix calculus. Matrix algebra. Patrick Breheny. January 20
Matrix algebra January 20 Introduction Basics The mathematics of multiple regression revolves around ordering and keeping track of large arrays of numbers and solving systems of equations The mathematical
More informationMarkov Chains, part I
Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X
More informationInteractive Math Glossary Terms and Definitions
Terms and Definitions Absolute Value the magnitude of a number, or the distance from 0 on a real number line Additive Property of Area the process of finding an the area of a shape by totaling the areas
More informationChapter 4 Lecture Notes
Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a realvalued function defined on the sample space of some experiment. For instance,
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationL10: Probability, statistics, and estimation theory
L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian
More informationContinued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
More informationMUSICAL INSTRUMENT FAMILY CLASSIFICATION
MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E540, Cambridge, MA 039 USA PH: 67530 FAX: 6758664 email: rago @ media.
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 19967 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationPSS 27.2 The Electric Field of a Continuous Distribution of Charge
Chapter 27 Solutions PSS 27.2 The Electric Field of a Continuous Distribution of Charge Description: Knight ProblemSolving Strategy 27.2 The Electric Field of a Continuous Distribution of Charge is illustrated.
More informationMATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
More informationThe sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].
Probability Theory Probability Spaces and Events Consider a random experiment with several possible outcomes. For example, we might roll a pair of dice, flip a coin three times, or choose a random real
More informationLecture 12: An Overview of Speech Recognition
Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated
More informationSpot me if you can: Uncovering spoken phrases in encrypted VoIP conversations
Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationCORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREERREADY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREERREADY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
More informationThe basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23
(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, highdimensional
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According
More informationDiscrete probability and the laws of chance
Chapter 8 Discrete probability and the laws of chance 8.1 Introduction In this chapter we lay the groundwork for calculations and rules governing simple discrete probabilities. These steps will be essential
More informationCHAPTER 3 Numbers and Numeral Systems
CHAPTER 3 Numbers and Numeral Systems Numbers play an important role in almost all areas of mathematics, not least in calculus. Virtually all calculus books contain a thorough description of the natural,
More informationAdvanced 3G and 4G Wireless Communication Prof. Aditya K. Jagannatham Department of Electrical Engineering Indian Institute of Technology, Kanpur
Advanced 3G and 4G Wireless Communication Prof. Aditya K. Jagannatham Department of Electrical Engineering Indian Institute of Technology, Kanpur Lecture  3 Rayleigh Fading and BER of Wired Communication
More informationDeterministic Problems
Chapter 2 Deterministic Problems 1 In this chapter, we focus on deterministic control problems (with perfect information), i.e, there is no disturbance w k in the dynamics equation. For such problems there
More informationIntroduction. Independent Component Analysis
Independent Component Analysis 1 Introduction Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multidimensional) statistical data. What distinguishes
More informationGraphical Models. CS 343: Artificial Intelligence Bayesian Networks. Raymond J. Mooney. Conditional Probability Tables.
Graphical Models CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin 1 If no assumption of independence is made, then an exponential number of parameters must
More informationHidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA Machine Learning Project
Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA 308761 Machine Learning Project Kaleigh Smith January 17, 2002 The goal of this paper is to review the theory of Hidden
More informationHigh School Algebra 1 Common Core Standards & Learning Targets
High School Algebra 1 Common Core Standards & Learning Targets Unit 1: Relationships between Quantities and Reasoning with Equations CCS Standards: Quantities NQ.1. Use units as a way to understand problems
More informationQuadratic Equations and Inequalities
MA 134 Lecture Notes August 20, 2012 Introduction The purpose of this lecture is to... Introduction The purpose of this lecture is to... Learn about different types of equations Introduction The purpose
More informationIntroduction. The Aims & Objectives of the Mathematical Portion of the IBA Entry Test
Introduction The career world is competitive. The competition and the opportunities in the career world become a serious problem for students if they do not do well in Mathematics, because then they are
More information15 Markov Chains: Limiting Probabilities
MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the nstep transition probabilities are given
More informationIdentification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models
Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models Denis Tananaev, Galina Shagrova, Victor Kozhevnikov NorthCaucasus Federal University,
More informationMATH10212 Linear Algebra. Systems of Linear Equations. Definition. An ndimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0534405967. Systems of Linear Equations Definition. An ndimensional vector is a row or a column
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More information6.3 Conditional Probability and Independence
222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted
More information2014 Assessment Report. Mathematics and Statistics. Calculus Level 3 Statistics Level 3
National Certificate of Educational Achievement 2014 Assessment Report Mathematics and Statistics Calculus Level 3 Statistics Level 3 91577 Apply the algebra of complex numbers in solving problems 91578
More informationDomain Essential Question Common Core Standards Resources
Middle School Math 20162017 Domain Essential Question Common Core Standards First Ratios and Proportional Relationships How can you use mathematics to describe change and model real world solutions? How
More informationFace Recognition using Principle Component Analysis
Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA
More informationBracken County Schools Curriculum Guide Math
Unit 1: Expressions and Equations (Ch. 13) Suggested Length: Semester Course: 4 weeks Year Course: 8 weeks Program of Studies Core Content 1. How do you use basic skills and operands to create and solve
More informationBLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION
BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be
More informationFormal Languages and Automata Theory  Regular Expressions and Finite Automata 
Formal Languages and Automata Theory  Regular Expressions and Finite Automata  Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationLinear Systems and Gaussian Elimination
Eivind Eriksen Linear Systems and Gaussian Elimination September 2, 2011 BI Norwegian Business School Contents 1 Linear Systems................................................ 1 1.1 Linear Equations...........................................
More informationEXPONENTS. To the applicant: KEY WORDS AND CONVERTING WORDS TO EQUATIONS
To the applicant: The following information will help you review math that is included in the Paraprofessional written examination for the Conejo Valley Unified School District. The Education Code requires
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationInfinite Algebra 1 supports the teaching of the Common Core State Standards listed below.
Infinite Algebra 1 Kuta Software LLC Common Core Alignment Software version 2.05 Last revised July 2015 Infinite Algebra 1 supports the teaching of the Common Core State Standards listed below. High School
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability
CS 7 Discrete Mathematics and Probability Theory Fall 29 Satish Rao, David Tse Note 8 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces
More informationThis unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.
Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course
More informationOverview of Math Standards
Algebra 2 Welcome to math curriculum design maps for Manhattan Ogden USD 383, striving to produce learners who are: Effective Communicators who clearly express ideas and effectively communicate with diverse
More informationChapter 4. Probability and Probability Distributions
Chapter 4. robability and robability Distributions Importance of Knowing robability To know whether a sample is not identical to the population from which it was selected, it is necessary to assess the
More informationSection 1.1. Introduction to R n
The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to
More informationSpeech Recognition Basics
Sp eec h A nalysis and Interp retatio n L ab Speech Recognition Basics Signal Processing Signal Processing Pattern Matching One Template Dictionary Signal Acquisition Speech is captured by a microphone.
More informationMATHEMATICS (CLASSES XI XII)
MATHEMATICS (CLASSES XI XII) General Guidelines (i) All concepts/identities must be illustrated by situational examples. (ii) The language of word problems must be clear, simple and unambiguous. (iii)
More informationA Supervised Approach To Musical Chord Recognition
Pranav Rajpurkar Brad Girardeau Takatoki Migimatsu Stanford University, Stanford, CA 94305 USA pranavsr@stanford.edu bgirarde@stanford.edu takatoki@stanford.edu Abstract In this paper, we present a prototype
More informationAlgebra I Pacing Guide Days Units Notes 9 Chapter 1 ( , )
Algebra I Pacing Guide Days Units Notes 9 Chapter 1 (1.11.4, 1.61.7) Expressions, Equations and Functions Differentiate between and write expressions, equations and inequalities as well as applying order
More informationDETERMINANTS. b 2. x 2
DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in
More informationThe School District of Palm Beach County ALGEBRA 1 REGULAR Section 1: Expressions
MAFS.912.AAPR.1.1 MAFS.912.ASSE.1.1 MAFS.912.ASSE.1.2 MAFS.912.NRN.1.1 MAFS.912.NRN.1.2 MAFS.912.NRN.2.3 ematics Florida August 16  September 2 Understand that polynomials form a system analogous
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More information2. Describing Data. We consider 1. Graphical methods 2. Numerical methods 1 / 56
2. Describing Data We consider 1. Graphical methods 2. Numerical methods 1 / 56 General Use of Graphical and Numerical Methods Graphical methods can be used to visually and qualitatively present data and
More informationCHAPTER 2. Inequalities
CHAPTER 2 Inequalities In this section we add the axioms describe the behavior of inequalities (the order axioms) to the list of axioms begun in Chapter 1. A thorough mastery of this section is essential
More informationFinite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras. Lecture  01
Finite Element Analysis Prof. Dr. B. N. Rao Department of Civil Engineering Indian Institute of Technology, Madras Lecture  01 Welcome to the series of lectures, on finite element analysis. Before I start,
More informationDecember 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in twodimensional space (1) 2x y = 3 describes a line in twodimensional space The coefficients of x and y in the equation
More information