# Comp Fundamentals of Artificial Intelligence Lecture notes, Speech recognition

Save this PDF as:

Size: px
Start display at page:

Download "Comp 14112 Fundamentals of Artificial Intelligence Lecture notes, 2015-16 Speech recognition"

## Transcription

2 1.2 Extracting useful features from sound waves Sound is carried in the air by pressure waves. In Figure 1 we show how a sound wave is typically represented, as changes in air pressure at a fixed location. Figure 1: A sound wave represents changes in air pressure over time at a fixed location. The vertical axis represents air pressure, increasing upwards. The horizontal axis represents time, which is increasing to the right. This representation of a sound wave is quite nice to look at, but it is not very useful for building a speech recognition system. In order to extract useful features from this signal we therefore have to use some techniques from a subject called signal processing. We will only give a rough sketch of these methods, since the details of signal processing algorithms are quite technical if you have any friends doing an Electrical Engineering degree then you can quiz them about the details. Hopefully you ll get the general idea. Our main interest in this course is what to do after the signal processing. Figure 2 shows how we begin to extract more useful features from a sound wave. We first break up the signal into a sequence of shorter overlapping segments. It is assumed that the sound wave is sufficiently uniform within each segment so that its main properties can be represented by the same vector of features. For sound waves corresponding to speech this assumption typically holds quite well. In the figure, F t refers to the vector of features at time t. Figure 2: Extracting features from a sound wave. After segmenting the signal, we now need to find a good set of features to describe the signal within each segment. A popular choice is to use something called a Fourier transform to extract the dominant frequencies of the signal within each segment. The details of the Fourier transform are quite mathematical and are beyond the scope of this course (although you are welcome to investigate them 2

3 in your own time if you are interested). However, a useful analogy is to think of a musical chord, which is a combination of notes at specific frequencies. The Fourier transform allows you to extract the individual notes from the chord and determine exactly how loud the contribution from each note is. The same principle can be applied to an arbitrary sound wave, except that in this case there may be contributions from a continuous range of frequencies. After some processing, we obtain a set of Mel-frequency cepstrum coefficients (MFCCs) for each segment of the signal. In the labs you will be working with data representing 13 MFCCs. These are 13 numbers representing the contribution from different frequency bands obtained by a Fourier transform. There is a vector of 13 MFCC features for each segmented part of the speech signal. The Mel-frequency bands are spaced so that the perceived frequency range of each band appears to be similar to the human ear. 1.3 Representing words After the signal processing stage, the sound is represented as a sequence of 13-dimensional vectors, each representing a segment of the original sound wave. The speech recognition task is to label these in some way, so that we can identify the meaning of the spoken word or sentence. One approach would be to directly label the segments as representing a particular word; that is the approach we will be following in the lab and it works reasonably well for short words like yes and no. However, a better approach for longer words and sentences is to break them up into smaller and more elementary utterances called phonemes. Figure 3 shows the sequence of phonemes corresponding to the start of a sentence beginning The Washington (spoken with an East Coast US accent). A quick web-search will locate a number of different phonetic alphabets in common use. Sometimes it is useful to combine phonemes, e.g. a sequence of three phonemes is known as a triphone. Sound signal Extracted features Labelling corresponding phonemes Path through HMM Path through triphone HMM Figure 3: Representing a sentence as a sequence of phonemes. The bottom rows correspond to phoneme and triphone HMMs which we will discuss later. 1.4 Language models So far we have only considered the low-level processing of a sound wave in order to extract features that can be labelled as words or phonemes. Humans are also able to use their knowledge of language in order to improve their recognition of spoken words, e.g. if we miss a word, or don t hear it very clearly, then we may be able to guess what it was most likely to be. We have a sophisticated understanding of language and no artificial intelligence is yet able to understand natural language as well as we can. The problem of understanding natural languages is very challenging indeed, much harder than the problem of recognising individual spoken words. However, we can use simple statistical language models in order to help disambiguate cases that our speech recognition system finds difficult. The models can identify simple patterns in language, e.g. which words are more likely to follow a particular word in a given language. You may have a simple language model in your mobile 3

4 phone s predictive text mode. Probabilistic methods provide an excellent approach for combining evidence from language models with evidence from speech signals. 1.5 Training: learning by example The models that we will use have to be trained on real examples of speech. Training is the process whereby the parameters of a model are adapted to a particular problem domain, i.e. we fit the parameters to some exemplar data known as training data. Training is a fundamental part of many Artificial Intelligence applications and is the main focus of Machine Learning, a field of research that lies on the interface between Artificial Intelligence and Statistics. Many of the most successful realworld applications of Artificial Intelligence involve some kind of training. In the current problem the speech models that we will develop will be parameterised based on real examples of speech. 1.6 Back to the problem Now we have the building blocks that a speech recognition system works with. In the pre-processing stage we use signal processing to convert a sound wave into a sequence of MFCC feature vectors. Our job is then to label each of these feature vectors with the word or phoneme being spoken at that time. In the next few lectures we will investigate a number of different approaches for solving this task. 2 Building a feature-based classifier Before constructing a HMM for the speech recognition problem, we will first investigate a simpler approach based on analysis of a small set of features de- scribing each recorded word. We will show how such a set of features can be used to construct a probabilistic classifier. A classifier is an agent which takes some input data, in our case features extracted from a speech signal, and predicts the class of the data. In our case the two possible classes are C1 = yes or C2 = no. Ideally, if the speech signal comes from a person saying yes, then the classifier should predict C1, while if the speech signal comes from a person saying no then the classifier should predict C2. In practice, it is unusual to obtain perfect performance and most methods will make mistakes and missclassify some examples. Our goal when building a classifier is to make as few mistakes as possible. As well as limiting the number of mistakes, it will also be useful for our classifier to represent its belief about the class of each example. Some examples are going to be easy and the agent will be quite certain about the classification. Some examples will be much harder to classify and the agent will be less certain. In an application it would be very useful for the classifier to be able to indicate that it is uncertain, since then the speaker could be prompted to repeat a word. As in the robot localisation example, we can use probabilities to represent the agent s belief in the classification. 2.1 The data and features In the lab you will be given 165 examples of recorded speech from different people saying yes and no. We will use this data in order to train our classifier. The raw sound signals are in WAVE files (with a.wav extension) and the corresponding sequence of MFCCs are in files with a.dat extension. The speech signals have been cropped in order to exclude the silence before and after each word is spoken. You will see how this cropping can be done using an HMM later. In order to extract a simple set of features for our classifier we will simply average each MFCC across all time segments. On the left of figure 4 we show the average of the 1st MFCC for the 165 examples given in the lab exercise. It can be seen that there is quite a large difference between the typical results for the different words and for different speakers. This feature looks like it may be useful for constructing a classifier. We see that all examples whose 1st MFCC is below -15 correspond to a yes, while all examples whose 1st MFCC is above -5 correspond to a no. When the 1st MFCC lies between these values then the example may either be a yes or a no. 4

5 Figure 4: On the left is a histogram showing the time-averaged 1st MFCC from 165 examples of different speakers saying yes and no. On the right we show the signal duration for the same data (in seconds). On the right of figure 4 we show the duration of the signals for the same words. This feature looks much less useful for constructing a classifier. 2.2 Building a classifier from one feature Our general approach to building a classifier will be based on Bayes theorem. We will take C1 to mean the proposition the class is C1 and we will take x to mean the value of the feature is observed to be x. Then we can use Bayes theorem to work out the probability that the class is C1 given that we have observed the feature to be x, = (1) Here, p(c1) is our prior belief that the word is in the class C1. We have p(c2) = 1 p(c1) since we are assuming that the word must be in one class or the other. The prior will depend on the application. For example, if we were using the yes / no classifier for a cinema telephone booking system, then it could be reasonable to assume that the caller is not interested in most films and therefore set p(c1) to be quite small. In other applications the agent s prior belief will be equal for both classes and p(c1) = p(c2) = 0.5. In order to calculate p(c1 x) using the above formula, we have to know what p(x C1) and p(x C2) are. In the robot localisation example these were constructed by modelling the error made by the robot sensors as a normal distribution. In the present example we will use the same approach, but now we will take more care explaining exactly how to use data in order to estimate the parameters of the normal distribution. This is what we mean by training the model. The features we will consider are continuous numbers. That means that p(x C1) and p(x C2) are actually representing probability density functions rather than probability mass functions. The normal distribution is a very important and useful example of a probability density function and we describe it in detail now Probability densities and the normal distribution A probability density function p(x) represents the probability p(x)dx that the variable x lies in the interval [x, x+dx] in the limit dx 0. It is a non-negative function and the area under the function is one, i.e. 5

6 The mean µ and variance σ 2 are defined to be, =1 =, = (2) where σ is known as the standard deviation and characterises the width, or spread, of the distribution. The normal distribution, also known as the Gaussian distribution, has the following probability density function, = (3) and the mean and variance are the two parameters of the distribution. Figure 5 shows two examples of normal density functions, with a different mean and variance in each case. Figure 5: On the left we show the density function p(x 1 ) for a normally distributed variable x 1 with mean 2 and variance 1. On the right we show the density function p(x 2 ) for a normally distributed variable x 2 with mean 4 and variance 0.5. The arrows indicate a width of two standard deviations (2σ) Training: fitting the normal distributions to data We will approximate p(x C1) and p(x C2) by normal distributions fitted to some training data. It is possible to fit a normal distribution to data by setting its mean and variance equal to the empirical mean m and variance σ 2 of the data. Given a set of N data points {x (1), x (2),..., x (n) } then m and s 2 are defined as = And = Note that the label n is a superscript in these expressions, not a power. (4) On the left of figure 6 we show two normal distributions fitted to the yes and no data histograms by setting µ = m and σ 2 = s 2 for each of the two classes of data. The histograms have been normalised now so that they represent frequencies in each class, and therefore they have area one. This provides a simple and useful way for us to estimate p(x C 1 ) and p(x C 2 ). This fit is not perfect, because it is unlikely that the data is exactly normally distributed and because we have quite a small 6

7 data set. However, the fit doesn t look too bad, and it provides us with a simple method to build a classifier. Figure 6: On the left are the normal density functions fitted to yes and no data. The mean and variance of the normal densities were set equal to the empirical mean and variance of MFCC 1 in the training data. On the right is the resulting classification rule obtained by assuming equal prior probability for each class Building the classifier Now that we have expressions for p(x C 1 ) and p(x C 2 ), learned from our training data, we can simply plug them into equation (1) and work out the required classification probability for any new example. The result is shown on the right of figure 6, where we have assumed equal prior probability for each class, e.g. p(c 1 ) = p(c 2 ) = 0.5. The classifier rule is intuitively reasonable. For small values of MFCC 1 the classifier is confident that the word is yes and the probability is therefore close to one in this regime. For relatively large values of MFCC 1 the classifier is confident that the word is no so that the probability of yes is close to zero. In-between these regimes the classifier is less confident about the correct classification. 2.3 Building a classifier from multiple features We can usually build a better classifier by considering more than one feature. Let bold x now represent a vector of features x = [x 1, x 2,, x d ]. This vector can be represented as a point in a d- dimensional space; we therefore often refer to the examples in the data set as data points. As an example, figure 7 shows the same data as previously but now plotting two MFCCs for each data point. We can generalise to higher dimensions by considering further MFCCs. In this case it is harder to visualise the data, but exactly the same methods that we develop below can be applied. The probability of the feature vector x is interpreted to mean p(x) = p(x 1 Λ x 2 Λ... x d ) where x i is taken to mean the value of the ith feature is x i. We often write p(x 1 Λ x 2 Λ... x d ) as p(x 1, x 2,, x d ) so that the comma is interpreted as an and. This is natural when dealing with vectors, since the probability can be thought of as a function with d arguments. 7

8 Figure 7: If two features are used to build the classifier then the data can be represented as points in two dimensions. Each point can be considered a vector x = [x 1, x 2 ], where x 1 is the distance from the origin along the horizontal axis and x 2 is the distance from the origin along the vertical axis. We can easily generalise equation (1). The only difference is that the scalar quantity x is now replaced by the vector quantity x. By Bayes theorem again, = (5) In order to use this classification rule, we have to come up with expressions for p(x C1) and p(x C2). We will again approximate these by using normal distributions fitted to some training data. We will further assume that the features are conditionally independent given knowledge of the class. This means that we make the simplifying assumption that the probability density for the vector x can be written as a product of the densities for each feature, i.e. = = 8 = = In figure 8 we show an example of a normal density in two dimensions obtained by making this assumption. Each feature x 1 and x 2 has a normal density as given in the previous example (recall figure 5). In the bottom row we show the corresponding density function for the two-dimensional vector x = [x 1, x 2 ]. Note that the volume under the density surface for a 2D vector is one, which is analogous to the condition that the area under the curve is one for the density of a scalar quantity. As before, we can estimate the mean and variance of p(x i C 1 ) and p(x i C 2 ) as the empirical mean and variance of each feature using the training data from each class. On the left of figure 9 we show the

9 contours of equal probability for normal densities fitted to the data from each class. Once we have approximated p(x C 1 ) and p(x C 2 ) in this way, we can then compute p(c 1 x) using equation (5) just as we did for the single feature example. We will again assume equal prior probabilities p(c 1 ) = p(c 2 ) = 0.5. The only difference to the single feature case is that now p(c 1 x) depends on a vector, which in our two-dimensional example has two components x = [x 1, x 2 ]. In this case one can show the resulting classification probability as a surface, as shown on the right of figure 9. Figure 8: Examples of normal density functions in one and two dimensions. Top left: Density function p(x 1 ) for a normally distributed variable x 1 with mean 2 and variance 1. Top right: Density function p(x 2 ) for a normally distributed variable x 2 with mean 4 and variance 0.5. Bottom left: Density function p(x 1, x 2 ) obtained by making the simplifying assumption that p(x 1 Λ x 2 ) = p(x 1 )p(x 2 ). Bottom right: Contours of equal probability for the same density function. The type of classifier that has been developed here is an example of a naïve Bayes classifier. It is called naive because of the assumption that features are conditionally independent given knowledge of the class. This greatly simplifies the classification problem, but it is quite a crude approximation for many problems. However, this remains a popular type of probabilistic classifier and works well in many applications. More advanced classifiers have been developed and classification is quite a mature and highly developed research area in Machine Learning. 9

10 Figure 9: On the left we show the result of fitting two normal densities to the data points from each class. On the right is the resulting classification probability, assuming equal prior probability for each class. 3 Markov chains A Markov chain, also known as a 1st order Markov process, is a probabilistic model of sequences. It can be thought of as a generative model, i.e. a model of how sequences of objects can be produced in a random or probabilistic fashion. 3.1 Definitions An nth order Markov process specifies a probability distribution over sequences. Let s t be the object at time t in a temporal sequence of objects. The objects in the sequence belong to a finite set of states, s t S, known as the state space. In our case these states will be either words, phonemes or triphones, depending on the type of model we are discussing. In the case of words, the state space would be the vocabulary. In the case of phonemes, the state space would be the phonetic alphabet being used or some subset of the phonetic alphabet. Let s denote the whole sequence from s 1 to s T where T is the length of the sequence, s = s 1 s 2 s 3... s t 1 s t... s T. For an nth order Markov process, the probability of st being generated at time t is only determined by the n values before it in the sequence. In other words, the probability of the current element is conditionally independent of all previous elements given knowledge of the last n, p(st st 1, st 2,..., s1) = p(st st 1, st 2,..., st n). A zeroth order Markov process is an uncorrelated random sequence of objects. Throws of a die or coin tosses are examples of zeroth order Markov processes, since the previous event has no influence on the current event. This is not a good model for many real-world sequential processes, such as sequences of words or phonemes, where the context of an object has a large influence on its probability of appearing in a sequence. We are therefore typically more interested in the case where the process order n > 0. In this course we will mainly be dealing with 1st order Markov processes, commonly referred to as Markov chains, in which case 10

11 p(st st 1, st 2,..., s1) = p(st st 1) and the probability of the current element in the sequence is only determined by the previous one. Later we will see that higher order Markov processes can always be mapped onto a 1st order Markov process with an expanded state space; therefore it is sufficient for us to only consider the 1st order case and all the results carry over to the higher order case in a straightforward way. We will only consider homogeneous Markov processes, defined as those for which p(s t = i s t 1 = j) = p(s t = i s t 1 = j) for all times t and t. 3.2 Graphical representation A Markov chain can be represented by a directed graph, which is a graph where the edges between pairs of nodes have a direction. An example is shown in Figure 10. Here, there are two states a and b. The numbers associated with the edges are the conditional probabilities determining what the next state in the sequence will be, or whether the sequence will terminate. These probabilities are called transition probabilities. They remain fixed over time for a homogeneous Markov process. One can imagine generating sequences of a and b from this graph by starting at the START state and then following the arrows according to their associated transition probability until we reach the STOP state. Hence, we see that this is a generative model of sequences. The START and STOP states are special because they do not generate a symbol and therefore we do not include them in the state space S. Figure 10: Graphical representation of a Markov chain model of sequences with state space S = {a, b}. The information represented by Figure 10 is, p(s 1 = a) = 0.5, p(s 1 = b) = 0.5, p(s t = a s t 1 = a) = 0.4, p(s t = a s t 1 = b) = 0.2, p(s t = b s t 1 = b) = 0.7, p(s t = b s t 1 = a) = 0.5, p(stop s t = a) = 0.1, p(stop s t = b) = Calculating the probability of a sequence We can use the transition probabilities in order to work out the probability of the model producing a particular sequence s, 11

12 = = where we have defined s T+1 = STOP. For example, = = = = = = = Which is 0.5 times 0.4 times 0.5 times 0.1 or Normalisation condition Notice that the arrows leaving each state sum to one. This is the normalisation condition for a Markov chain model and can be written, =1 (7) This condition ensures that we make a transition at time t with probability Transition matrix An alternative way to describe a Markov chain model is to write the transition probabilities in a table. Table 1 shows the transition probabilities associated with the model in Figure 10. The matrix of numbers in this table (or sometimes its transpose) is referred to as the transition matrix of the model. Notice that the numbers in each row sum to one. This is the same as the normalisation condition, i.e. that the transitions from each state sum to one, also equivalent to equation (7) above. A matrix with this property is called a stochastic matrix. p(st st 1) st = a st = b st = STOP st 1 = a st 1 = b st 1 = START Table 1: This table shows the transition probabilities p(st st 1) between states for the Markov chain model given in Figure 10. Transitions are from the states on the left to the states along the top. 3.6 Unfolding in time We can display all of the possible sequences of a fixed length T and their associated probabilities by unfolding a Markov chain model in time. In Figure 11 we show the model from Figure 10 unfolded for three time steps. This explicitly shows all of the possible sequences of length three that could be produced and their probabilities can be obtained by simply multiplying the numbers as one passes along the edges from the START state to the STOP state. Notice that the transition probabilities leaving each state don t sum to one for this unfolded model because it does not represent all possible sequences. Paths leading to sequences that are shorter or longer than three are not included in this figure. 12

13 Figure 11: Unfolding the model shown in Figure 10 to show all possible sequences of length three. This unfolded model shows that there is an exponential increase in the number of possible sequences as the length of the sequence increases. In this example there are possible sequences of length 3. One can see this by observing that there are two choices for any sequence for t = 1, 2 and 3. There are 2 L possible distinct sequences of length L. This number soon becomes huge as L increases. We therefore have to be very careful to design efficient algorithms that don t explicitly work with this number of sequences. Luckily such algorithms exist, which is one of the reasons that Markov chain models and hidden Markov models are so useful. 3.7 Phoneme state models In Figure 12 we show a model where the states are the phonemes that combine to make the words yes and no. Each state has a self-transition, which allows the phoneme to be repeated a number of times so, for example, yes could be y-eh-s, y-eh-eh-s or y-y-eh-s-s etc. The reason for these self-transitions is that they can be used to model the duration of each phoneme in speech. Figure 12: Markov chain model with a phoneme state space S = {y, eh, s, n, ow}. The model generates phoneme sequences corresponding to the words yes and no. Each phoneme can be repeated a number of times to model its duration within a spoken word. p(st st 1) y eh s n ow STOP y eh s n ow START Table 2: This table shows the transition probabilities between states for the Markov chain model given in Figure 12. Most transition probabilities are zero indicating that these states are not connected. A matrix like this with many zero entries is called a sparse matrix. The transition probabilities for this model are given in Table 2. As before, the normalisation condition ensures that the rows sum to one. Notice that most of the entries in the table are zero, indicating that most of the transitions between states are not allowed. This sparse structure is typical of Markov 13

14 chain models for phonemes corresponding to a limited number of specific words, since the number of possible paths is then highly restricted. In Figure 13 the same model has been unfolded in time to show all possible sequences of exactly three phonemes. The probability of each sequence being produced by the model can easily be worked out by multiplying the transition probabilities along each path, p(y-eh-s) = = , p(n-n-ow) = = 0.09, p(n-ow-ow) = = Figure 13: The phoneme model from Figure 12 unfolded to show all possible sequences of length three. Since there are two ways to produce no then we can add them to obtain the probability that a sequence has length T = 3 and is no, p( no Λ T = 3) = = We can also work out the probability that a sequence of length T = 3 is a no by using the rules of probability that you should know, "no" T=3 "no" =3 = =3 "no" =3 = "no" =3 + "yes" =3 = This calculation shows that a sequence of length 3 is more likely to be a no than a yes in this model. At first this may seem to contradict the model, since Figure 12 shows that sequences representing the words yes and no are generated with equal probability. This can be seen by observing that the initial transitions to y and n are equally likely, so sequences starting with y and n are therefore equally likely. However, if we only consider short sequences then the model assigns greater probability to the word no. This word has less states than yes and the self-transitions from these states have lower probability, resulting in shorter words on average. Therefore, once we condition on the length of the sequence then the two words are no longer equally likely. 3.8 Efficient computation and recursion relations The above calculation is simple for short sequence. However, the number of possible paths through the model grows exponentially with the sequence length and explicit computation becomes infeasible. Luckily, the rules of probability provide us with an efficient means to deal with this in Markov chain models. Consider the problem of determining the probability that a sequence is of length T and a no, as we did above for the special case where T = 3. We note that p( no Λ T ) is equivalent to p(s 1 = n, 14

15 s T+1 = STOP) for the model in Figure 12 (we will use a comma in place of Λ to make the notation below more compact). We can write the sum over all possible sequences between s 1 = n and s T+1 =STOP as, =, = = =,,,,, = = = On the face of it, this looks like a sum over a huge number paths through the model. In the worst case this sum would require the calculation of S T 1 terms where S is the size of the state space. However, by evaluating these sums in a particular order, as shown below, we can calculate the sum much more efficiently:, = = = =, = =, =, = =, = =, = =, = Except for the first and last lines, we see that each line involves a sum over S terms which has to be worked out S times. Therefore, each of these lines requires a number of operations (multiplications and additions) roughly proportional to S 2. The number of these lines is T 2, so we see that the number of operations is at least (T 2) S 2. This is much better than an exponential scaling and allows the computation to be carried out efficiently for long sequences and quite large state spaces. For models where many states are not connected, such as typical phoneme models, then the computation time is even less. The reason for this is that the sums above only need to include connected states, since p(s t = i s t 1 = j) = 0 for two unconnected states i and j (recall Table 2). In the 2nd year algorithms course you will see more examples of algorithms like this, and you will learn how to calculate their computational complexity which determines how the computation time scales with L and S. We can write the above set of equations in a more compact form as a recursion relation,, = = = =, = =, = 2 where we have defined s T+1 = STOP. 3.9 Training It is possible to fit a Markov chain model to training data in order to estimate the transition probabilities. The simplest approach is just to use frequency counts from the sequences in the training data. Let Nij be the number of times that state j follows state i in the training data. We simply set the transition probabilities to be proportional to these counts, i.e. = = =,, (9) 15

16 The denominator ensures that the probabilities are properly normalised, i.e. the sum of all transition probabilities leaving any state is one. As an example consider the two-state Markov model in Figure 10. Imagine that the transition probabilities are unknown but that we have observed the following three sequences, abbaab ababaaa abbbba. We are asked to estimate the transition probabilities for the model from this training data. First we compute the counts for each type of transition, which we can put in a table as shown below. N i,j j = a j = b j = STOP i = a i = b i = START Then equation 9 shows that we should normalise the counts by the sum over each respective row to obtain the transition probabilities. p(st = j st 1 = i) j = a j = j = STOP i = a i = b 4/9 4/9 1/9 i = START Higher order models Figure 14: A 2nd order Markov process with state space S = {a, b} is equivalent to a 1st order Markov process with an expanded state space S = {aa, ab, ba, bb}. Notice that some transitions are not allowed. So far, we have only considered 1st order Markov processes. Higher order processes can capture even greater contextual information. In fact, all higher order Markov processes can be mapped onto a 1st order process with an extended alphabet. For example, if we have a fully connected 2nd order model with state space S = {a, b} then this is equivalent to a 1st order model with state space S = {aa, ab, ba, bb} with the transitions shown in Figure 14. The extended model is not fully connected, because there is a constraint that the states ending with a must be followed by states starting with a, and similarly states ending with b must be followed by states starting with b. The reason for this is that these states overlap in the original sequence, e.g. abbaaba ab bb ba aa ab ba. Since we can map a higher order Markov process onto a 1st order model, all the theory carries over in a straightforward manner. 16

17 A triphone Markov chain model is an example of a 3rd order phoneme Markov process mapped onto a 1st order process with an extended state space (recall Figure 3). The additional context provided by a triphone model can greatly improve the performance of the hidden Markov model speech recognition methods that we discuss next, because the context of a phoneme can strongly influence the way that it is spoken. 4 Hidden Markov Models Hidden Markov models (HMMs) combine aspects of the Markov chain model and feature-based classifier that you have seen in the previous two lectures. Underlying an HMM is a Markov chain model. The difference is that the states of the Markov chain cannot be observed; they are hidden or latent variables. Instead of producing a sequence of states, like a Markov chain, in the HMM each state emits a feature vector according to an emission probability p(xt st). All we observe are a sequence of feature vectors and we cannot be sure which state has produced them. Like a Markov chain model, an HMM is a generative model of sequential data, but the generated sequence is now a sequence of feature vectors, e.g. x 1, x 2, x 3,..., x t, x t+1... x T. In the speech recognition example this would be a sequence of MFCC feature vectors, for example. Figure 15 shows an example of an HMM that can be used to model utterances of the words yes and no with silence before and after. The transition probabilities have exactly the same meaning as for a Markov chain. They show that the words are equally likely and that the silence before and after is of the same typical duration. As usual, the normalisation condition ensures that the transition probabilities leaving each state must sum to one. The difference between this and a Markov chain model is that each state is now associated with an emission probability distribution, p(x t s t = SIL) p(x t s t = yes ) p(x t s t = no ) which describes the distribution of features when each utterance is being spoken. 4.1 Probability of a sequence of features and states It is straightforward to calculate the joint probability of a sequence of features and a sequence of states. Every time we make a transition to a new state, we Figure 15: An HMM for modelling yes and no utterances with silence before and after. have to multiply the probability that the current feature vector was emitted by that state, i.e. 17

18 ,,,,,,, = = (10) where s T+1 = STOP. The problem is, we don t know what the state sequence s = s 1, s 2... s T is. If we knew the state sequence, then we would already have solved the problem since we would know which word has been spoken. Our task is to make inferences only using the feature data. This task can be carried out by two approaches, described below, for which efficient algorithms exist. 4.2 Classification One approach is to compute the probability of the observed sequence of features for models corresponding to different classes of data. For example, we can construct an HMM for utterances of the word yes as shown in Figure 16. Figure 16: An HMM for modelling the word yes with silence before and after. As before we will call yes class C 1 and no class C 2. We can use this model in order to work out p(x 1, x 2,..., x T C 1 ). We can use a similar model of the word no and compute p(x 1, x 2,, x T C 2 ). Then Bayes theorem allows us to classify a sequence,,,, =,,,,,, +,,, As usual, we need to assign the prior p(c1) in order to use this classification rule. We also need a way of efficiently computing p(x 1, x 2,, x T C i ) for each model. We can use a similar recursion relation to the one for summing over paths in the Markov chain model, which was given by equation (8). For each model we calculate,, =,,,, =,,,, 2,,, =,,,, This is known as the Forward Algorithm in the HMM literature, because it involves iterating forward along the sequence from t = 1 until terminating at t = T. 4.3 Decoding A classification approach works well for distinguishing between a small number of possible words or phrases. However, if we want to recognise whole sentences then this approach is not feasible. We can hardly consider a different model for every possible sentence in the English language. Instead, 18 (11)

19 we have to use a decoding approach to the problem. We want to determine which state sequence s = s 1, s 2,..., s T corresponds to a particular sequence of feature vectors x 1, x 2,..., x T (recall Figure 3). A sensible choice is the sequence s* that is most likely given the sequence of feature vectors, =,,,,,, Here, argmax means that the quantity on the right is maximised with respect to the quantity written underneath. This is an example of an optimisation problem, since we are finding the state sequence that optimises a particular quantity. Optimisation is important in many areas of Artificial Intelligence. It turns out that there exists an efficient algorithm for working this out, the celebrated Viterbi Algorithm. It has a similar recursive form to the Forward Algorithm. The details are beyond the scope of this course, but if you are interested then there are some good HMM reviews available that discuss it. Rabiner s [2] is a classic and also discusses speech recognition (you can find it via the University website) while Durbin et al. [1] provide a nice introduction to HMMs in a different application domain. These are both at an advanced level. Viterbi decoding was used in order to remove the silence from the data you saw in Lab 2. The optimal path through a model like the one in Figure 15 was obtained for all 165 examples in order to identify which parts of the speech were associated with the SIL states. All the feature vectors associated with the SIL states were then removed from the MFCC data files. For Lab 3 you will be working with the original uncropped data. 4.4 Training The transition and emission probabilities can be estimated by fitting them to training data. If the training data sequences are labelled, so that the corresponding state sequences are known, then training is straightforward. In this case the transition probabilities can be calculated using counts in the same way as for the Markov chain model, while the emission probability densities can be estimated in a similar fashion to the normal density models described earlier. However, hand-labelling of training data is difficult, time consuming and error prone. It is more usual that the training data is not labelled, although the sentence or words being spoken in the training data will usually be known. In this case training can be carried out using the Baum-Welch algorithm. The details of this algorithm are beyond the scope of the course. It is a special case of the EM-algorithm, which is also very useful in many other areas of Machine Learning. Acknowledgements Many thanks to Gwenn Englebienne for providing the figures for the signal processing examples and for help in processing the speech data. References 1 R. Durbin, S.R. Eddy, A. Krogh, and M. Mitchison. Biological Sequence Analysis. Cambridge Uuniversity Press, L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77: ,

### Hidden Markov Models. and. Sequential Data

Hidden Markov Models and Sequential Data Sequential Data Often arise through measurement of time series Snowfall measurements on successive days in Buffalo Rainfall measurements in Chirrapunji Daily values

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### Hidden Markov Models

Hidden Markov Models Phil Blunsom pcbl@cs.mu.oz.au August 9, 00 Abstract The Hidden Markov Model (HMM) is a popular statistical tool for modelling a wide range of time series data. In the context of natural

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

### Lecture 10: Sequential Data Models

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 10: Sequential Data Models 1 Example: sequential data Until now, considered data to be i.i.d. Turn attention to sequential data Time-series: stock

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### Signal Processing for Speech Recognition

Signal Processing for Speech Recognition Once a signal has been sampled, we have huge amounts of data, often 20,000 16 bit numbers a second! We need to find ways to concisely capture the properties of

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### Hidden Markov Models for biological systems

Hidden Markov Models for biological systems 1 1 1 1 0 2 2 2 2 N N KN N b o 1 o 2 o 3 o T SS 2005 Heermann - Universität Heidelberg Seite 1 We would like to identify stretches of sequences that are actually

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### Language Modeling. Chapter 1. 1.1 Introduction

Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

### Hidden Markov Model. Jia Li. Department of Statistics The Pennsylvania State University. Hidden Markov Model

Hidden Markov Model Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Hidden Markov Model Hidden Markov models have close connection with mixture models. A mixture model

### Hardware Implementation of Probabilistic State Machine for Word Recognition

IJECT Vo l. 4, Is s u e Sp l - 5, Ju l y - Se p t 2013 ISSN : 2230-7109 (Online) ISSN : 2230-9543 (Print) Hardware Implementation of Probabilistic State Machine for Word Recognition 1 Soorya Asokan, 2

### Machine Learning I Week 14: Sequence Learning Introduction

Machine Learning I Week 14: Sequence Learning Introduction Alex Graves Technische Universität München 29. January 2009 Literature Pattern Recognition and Machine Learning Chapter 13: Sequential Data Christopher

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### A frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes

A frequency distribution is a table used to describe a data set. A frequency table lists intervals or ranges of data values called data classes together with the number of data values from the set that

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Hidden Markov Models Fundamentals

Hidden Markov Models Fundamentals Daniel Ramage CS229 Section Notes December, 2007 Abstract How can we apply machine learning to data that is represented as a sequence of observations over time? For instance,

### Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### PowerTeaching i3: Algebra I Mathematics

PowerTeaching i3: Algebra I Mathematics Alignment to the Common Core State Standards for Mathematics Standards for Mathematical Practice and Standards for Mathematical Content for Algebra I Key Ideas and

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Sampling Distributions and the Central Limit Theorem

135 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 10 Sampling Distributions and the Central Limit Theorem In the previous chapter we explained

### Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0

Probability & Statistics Primer Gregory J. Hakim University of Washington 2 January 2009 v2.0 This primer provides an overview of basic concepts and definitions in probability and statistics. We shall

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.

Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given

### An introduction to Hidden Markov Models

An introduction to Hidden Markov Models Christian Kohlschein Abstract Hidden Markov Models (HMM) are commonly defined as stochastic finite state machines. Formally a HMM can be described as a 5-tuple Ω

### Monte Carlo Method: Probability

John (ARC/ICAM) Virginia Tech... Math/CS 4414: The Monte Carlo Method: PROBABILITY http://people.sc.fsu.edu/ jburkardt/presentations/ monte carlo probability.pdf... ARC: Advanced Research Computing ICAM:

### AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS PIERRE LANCHANTIN, ANDREW C. MORRIS, XAVIER RODET, CHRISTOPHE VEAUX Very high quality text-to-speech synthesis can be achieved by unit selection

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### CONATION: English Command Input/Output System for Computers

CONATION: English Command Input/Output System for Computers Kamlesh Sharma* and Dr. T. V. Prasad** * Research Scholar, ** Professor & Head Dept. of Comp. Sc. & Engg., Lingaya s University, Faridabad, India

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### S = {1, 2,..., n}. P (1, 1) P (1, 2)... P (1, n) P (2, 1) P (2, 2)... P (2, n) P = . P (n, 1) P (n, 2)... P (n, n)

last revised: 26 January 2009 1 Markov Chains A Markov chain process is a simple type of stochastic process with many social science applications. We ll start with an abstract description before moving

### APPLICATION OF HIDDEN MARKOV CHAINS IN QUALITY CONTROL

INTERNATIONAL JOURNAL OF ELECTRONICS; MECHANICAL and MECHATRONICS ENGINEERING Vol.2 Num.4 pp.(353-360) APPLICATION OF HIDDEN MARKOV CHAINS IN QUALITY CONTROL Hanife DEMIRALP 1, Ehsan MOGHIMIHADJI 2 1 Department

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### Basics Inversion and related concepts Random vectors Matrix calculus. Matrix algebra. Patrick Breheny. January 20

Matrix algebra January 20 Introduction Basics The mathematics of multiple regression revolves around ordering and keeping track of large arrays of numbers and solving systems of equations The mathematical

### Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

### Interactive Math Glossary Terms and Definitions

Terms and Definitions Absolute Value the magnitude of a number, or the distance from 0 on a real number line Additive Property of Area the process of finding an the area of a shape by totaling the areas

### Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Continued Fractions and the Euclidean Algorithm

Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction

### MUSICAL INSTRUMENT FAMILY CLASSIFICATION

MUSICAL INSTRUMENT FAMILY CLASSIFICATION Ricardo A. Garcia Media Lab, Massachusetts Institute of Technology 0 Ames Street Room E5-40, Cambridge, MA 039 USA PH: 67-53-0 FAX: 67-58-664 e-mail: rago @ media.

### Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### PSS 27.2 The Electric Field of a Continuous Distribution of Charge

Chapter 27 Solutions PSS 27.2 The Electric Field of a Continuous Distribution of Charge Description: Knight Problem-Solving Strategy 27.2 The Electric Field of a Continuous Distribution of Charge is illustrated.

### MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a

### The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Probability Theory Probability Spaces and Events Consider a random experiment with several possible outcomes. For example, we might roll a pair of dice, flip a coin three times, or choose a random real

### Lecture 12: An Overview of Speech Recognition

Lecture : An Overview of peech Recognition. Introduction We can classify speech recognition tasks and systems along a set of dimensions that produce various tradeoffs in applicability and robustness. Isolated

### Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations C. Wright, L. Ballard, S. Coull, F. Monrose, G. Masson Talk held by Goran Doychev Selected Topics in Information Security and

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

### The basic unit in matrix algebra is a matrix, generally expressed as: a 11 a 12. a 13 A = a 21 a 22 a 23

(copyright by Scott M Lynch, February 2003) Brief Matrix Algebra Review (Soc 504) Matrix algebra is a form of mathematics that allows compact notation for, and mathematical manipulation of, high-dimensional

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According

### Discrete probability and the laws of chance

Chapter 8 Discrete probability and the laws of chance 8.1 Introduction In this chapter we lay the groundwork for calculations and rules governing simple discrete probabilities. These steps will be essential

### CHAPTER 3 Numbers and Numeral Systems

CHAPTER 3 Numbers and Numeral Systems Numbers play an important role in almost all areas of mathematics, not least in calculus. Virtually all calculus books contain a thorough description of the natural,

### Advanced 3G and 4G Wireless Communication Prof. Aditya K. Jagannatham Department of Electrical Engineering Indian Institute of Technology, Kanpur

Advanced 3G and 4G Wireless Communication Prof. Aditya K. Jagannatham Department of Electrical Engineering Indian Institute of Technology, Kanpur Lecture - 3 Rayleigh Fading and BER of Wired Communication

### Deterministic Problems

Chapter 2 Deterministic Problems 1 In this chapter, we focus on deterministic control problems (with perfect information), i.e, there is no disturbance w k in the dynamics equation. For such problems there

### Introduction. Independent Component Analysis

Independent Component Analysis 1 Introduction Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multidimensional) statistical data. What distinguishes

### Graphical Models. CS 343: Artificial Intelligence Bayesian Networks. Raymond J. Mooney. Conditional Probability Tables.

Graphical Models CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin 1 If no assumption of independence is made, then an exponential number of parameters must

### Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA Machine Learning Project

Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA 308-761 Machine Learning Project Kaleigh Smith January 17, 2002 The goal of this paper is to review the theory of Hidden

### High School Algebra 1 Common Core Standards & Learning Targets

High School Algebra 1 Common Core Standards & Learning Targets Unit 1: Relationships between Quantities and Reasoning with Equations CCS Standards: Quantities N-Q.1. Use units as a way to understand problems

MA 134 Lecture Notes August 20, 2012 Introduction The purpose of this lecture is to... Introduction The purpose of this lecture is to... Learn about different types of equations Introduction The purpose

### Introduction. The Aims & Objectives of the Mathematical Portion of the IBA Entry Test

Introduction The career world is competitive. The competition and the opportunities in the career world become a serious problem for students if they do not do well in Mathematics, because then they are

### 15 Markov Chains: Limiting Probabilities

MARKOV CHAINS: LIMITING PROBABILITIES 67 Markov Chains: Limiting Probabilities Example Assume that the transition matrix is given by 7 2 P = 6 Recall that the n-step transition probabilities are given

### Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models

Identification of Exploitation Conditions of the Automobile Tire while Car Driving by Means of Hidden Markov Models Denis Tananaev, Galina Shagrova, Victor Kozhevnikov North-Caucasus Federal University,

### MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### 6.3 Conditional Probability and Independence

222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

### 2014 Assessment Report. Mathematics and Statistics. Calculus Level 3 Statistics Level 3

National Certificate of Educational Achievement 2014 Assessment Report Mathematics and Statistics Calculus Level 3 Statistics Level 3 91577 Apply the algebra of complex numbers in solving problems 91578

### Domain Essential Question Common Core Standards Resources

Middle School Math 2016-2017 Domain Essential Question Common Core Standards First Ratios and Proportional Relationships How can you use mathematics to describe change and model real world solutions? How

### Face Recognition using Principle Component Analysis

Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA

### Bracken County Schools Curriculum Guide Math

Unit 1: Expressions and Equations (Ch. 1-3) Suggested Length: Semester Course: 4 weeks Year Course: 8 weeks Program of Studies Core Content 1. How do you use basic skills and operands to create and solve

### BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION

BLIND SOURCE SEPARATION OF SPEECH AND BACKGROUND MUSIC FOR IMPROVED SPEECH RECOGNITION P. Vanroose Katholieke Universiteit Leuven, div. ESAT/PSI Kasteelpark Arenberg 10, B 3001 Heverlee, Belgium Peter.Vanroose@esat.kuleuven.ac.be

### Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Formal Languages and Automata Theory - Regular Expressions and Finite Automata - Samarjit Chakraborty Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology (ETH) Zürich March

### BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

### Linear Systems and Gaussian Elimination

Eivind Eriksen Linear Systems and Gaussian Elimination September 2, 2011 BI Norwegian Business School Contents 1 Linear Systems................................................ 1 1.1 Linear Equations...........................................

### EXPONENTS. To the applicant: KEY WORDS AND CONVERTING WORDS TO EQUATIONS

To the applicant: The following information will help you review math that is included in the Paraprofessional written examination for the Conejo Valley Unified School District. The Education Code requires

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Infinite Algebra 1 supports the teaching of the Common Core State Standards listed below.

Infinite Algebra 1 Kuta Software LLC Common Core Alignment Software version 2.05 Last revised July 2015 Infinite Algebra 1 supports the teaching of the Common Core State Standards listed below. High School

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

CS 7 Discrete Mathematics and Probability Theory Fall 29 Satish Rao, David Tse Note 8 A Brief Introduction to Continuous Probability Up to now we have focused exclusively on discrete probability spaces

### This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

### Overview of Math Standards

Algebra 2 Welcome to math curriculum design maps for Manhattan- Ogden USD 383, striving to produce learners who are: Effective Communicators who clearly express ideas and effectively communicate with diverse

### Chapter 4. Probability and Probability Distributions

Chapter 4. robability and robability Distributions Importance of Knowing robability To know whether a sample is not identical to the population from which it was selected, it is necessary to assess the

### Section 1.1. Introduction to R n

The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to

### Speech Recognition Basics

Sp eec h A nalysis and Interp retatio n L ab Speech Recognition Basics Signal Processing Signal Processing Pattern Matching One Template Dictionary Signal Acquisition Speech is captured by a microphone.

### MATHEMATICS (CLASSES XI XII)

MATHEMATICS (CLASSES XI XII) General Guidelines (i) All concepts/identities must be illustrated by situational examples. (ii) The language of word problems must be clear, simple and unambiguous. (iii)

### A Supervised Approach To Musical Chord Recognition

Pranav Rajpurkar Brad Girardeau Takatoki Migimatsu Stanford University, Stanford, CA 94305 USA pranavsr@stanford.edu bgirarde@stanford.edu takatoki@stanford.edu Abstract In this paper, we present a prototype

### Algebra I Pacing Guide Days Units Notes 9 Chapter 1 ( , )

Algebra I Pacing Guide Days Units Notes 9 Chapter 1 (1.1-1.4, 1.6-1.7) Expressions, Equations and Functions Differentiate between and write expressions, equations and inequalities as well as applying order

### DETERMINANTS. b 2. x 2

DETERMINANTS 1 Systems of two equations in two unknowns A system of two equations in two unknowns has the form a 11 x 1 + a 12 x 2 = b 1 a 21 x 1 + a 22 x 2 = b 2 This can be written more concisely in

### The School District of Palm Beach County ALGEBRA 1 REGULAR Section 1: Expressions

MAFS.912.A-APR.1.1 MAFS.912.A-SSE.1.1 MAFS.912.A-SSE.1.2 MAFS.912.N-RN.1.1 MAFS.912.N-RN.1.2 MAFS.912.N-RN.2.3 ematics Florida August 16 - September 2 Understand that polynomials form a system analogous

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### 2. Describing Data. We consider 1. Graphical methods 2. Numerical methods 1 / 56

2. Describing Data We consider 1. Graphical methods 2. Numerical methods 1 / 56 General Use of Graphical and Numerical Methods Graphical methods can be used to visually and qualitatively present data and

### CHAPTER 2. Inequalities

CHAPTER 2 Inequalities In this section we add the axioms describe the behavior of inequalities (the order axioms) to the list of axioms begun in Chapter 1. A thorough mastery of this section is essential