Advanced Matlab: Exploratory Data Analysis and Computational Statistics. Mark Steyvers

Transcription

1 Advanced Matlab: Exploratory Data Analysis and Computational Statistics Mark Steyvers January 14, 2015

2 Contents I Exploratory Data Analysis 4 1 Basic Data Analysis Organizing and Summarizing Data Visualizing Data Dimensionality Reduction Independent Component Analysis Applications of ICA How does ICA work? Matlab examples II Probabilistic Modeling 22 3 Sampling from Random Variables Standard distributions Sampling from non-standard distributions Inverse transform sampling with discrete variables Inverse transform sampling with continuous variables Rejection sampling Markov Chain Monte Carlo Monte Carlo integration Markov Chains Putting it together: Markov chain Monte Carlo Metropolis Sampling Metropolis-Hastings Sampling Metropolis-Hastings for Multivariate Distributions Blockwise updating Componentwise updating Gibbs Sampling

3 CONTENTS 2 5 Basic concepts in Bayesian Data Analysis Parameter Estimation Approaches Maximum Likelihood Maximum a posteriori Posterior Sampling Example: Estimating a Weibull distribution Directed Graphical Models A Short Review of Probability Theory The Burglar Alarm Example Conditional probability tables Explaining away Joint distributions and independence relationships Graphical Model Notation Example: Consensus Modeling with Gaussian variables Sequential Monte Carlo Hidden Markov Models Example HMM with discrete outcomes and states Viterbi Algorithm Bayesian Filtering Particle Filters Sampling Importance Resampling (SIR) Direct Simulation

4 Note to Students Exercises This course book contains a number of exercises in which you are asked to simulate Matlab code, produce new code, as well as produce graphical illustrations and answers to questions. The exercises marked with ** are optional exercises that can be skipped when time is limited. Matlab documentation It will probably happen many times that you will need to find the name of a Matlab function or a description of the input and output variables for a given Matlab function. It is strongly recommended to have the Matlab documentation running in a separate window for quick consultation. You can access the Matlab documentation by typing doc in the command window. For specific help on a given matlab function, such as the function fprintf, you can type doc fprintf to get a help screen in the matlab documentation window or help fprintf to get a description in the matlab command window. Organizing answers to exercises It is helpful to maintain a document that organizes all the material related to the exercises. Matlab can facilitate part of this organization using the publish option. For example, if you have a Matlab script that produces a figure, you can publish the code as well as the figure produced by the code to a single external document such as pdf. You can find the publishing option in the Matlab editor under the publish menu. You can change the publish configuration (look under the file menu of the editor window) to produce pdfs by changing the output file format under the edit configurations window. 3

5 Part I Exploratory Data Analysis 4

6 Chapter 1 Basic Data Analysis 1.1 Organizing and Summarizing Data When analyzing any kind of data, it is important to make the data available in an intuitive representation that is easy to change. It is also useful when sharing data with other researchers to package all experimental or modeling data into a single container that can be easily shared and documented. Matlab has two ways to package data into a single variable that can contain many different types of variables. One method is to use structures. Another method that was recently introduced in Matlab is based on tables. In tables, standard indexing can be used to access individual elements or sets of elements. Suppose that t is a table. The element in the fourth row, second column can be accessed by t(4,2). The result of this could be a scalar value, a string, cell array or whatever type of variable is stored in that element of the table. To access the second row of the table, you can use t(2,:). Similarly, to access the second column of the table, you can use t(:,2). Note that the last two operations return another table as a result. Another way to access tables is by using the names of the columns. Suppose the name of the second column is Gender. To access all the value from this column, you can use t.gender(:). Below is some Matlab code to create identical data representations using either a structure or a table. The gray text shows the resulting output produced from the command window. From the example code, it might not be apparent what the relative advantages and disadvantages are of structures and tables. The following exercises hopefully makes it clearer why using the new table format might be advantageous. It will be useful to read the matlab documentation on structures and tables under /LanguageFundamentals/ DataTypes/Structures and /LanguageFundamentals/DataTypes/Tables % Create a structure "d" with some example data d.age = [ ] ; d.gender = { Male, Female, Female, Male } ; d.id = [ ] ; 5

7 CHAPTER 1. BASIC DATA ANALYSIS 6 % Show d disp( d ) % Create a Table "t" with the same information t = table( [ ],... { Male, Female, Female, Male },... [ ],... VariableNames,{ Age, Gender, ID } ); % Show this table disp( t ) % Copy the Age values to a variable X x = t.age % Extract the second row of the table row = t( 2, : ) Age: [4x1 double] Gender: {4x1 cell} ID: [4x1 double] Age Gender ID 32 Male Female Female Male 445 x = row = Age Gender ID

8 CHAPTER 1. BASIC DATA ANALYSIS 7 24 Female 433 Exercises 1. In this exercise, we will load some sample data into Matlab and represent the data internally in a structure. In Matlab, execute the following command to create the structure d which contains some sample data about patients: d = table2struct( readtable('patients.dat'), 'ToScalar',true); Show in a single Matlab script how to a) calculate the mean age of the males, b) delete the data entries that correspond to smokers, and c) sort the entries according to age. 2. Let s repeat the last exercise but now represent the data internally with a table. In Matlab, execute the following command to create the table t which contains the same data about patients: t = readtable('patients.dat'); Show in a single Matlab script how to a) extract the first row of the table, b) how to extract the numeric values of Age from the Age column, c) calculate the mean age of the males, d) delete the data entries that correspond to smokers, and e) sort the entries according to age. What is the advantage of using the table representation? 3. With the table representation of the patient data, use the tabulate function to calculate frequency distribution of locations. What percentage of patients are located at the VA hospital? 4. Use the crosstabs function to calculate the contingency table of Gender by Smoker. How many female smokers are there in the sample? 5. Use the prctile function to calculate the 25% and 75% percentile of weights in the sample. 1.2 Visualizing Data Exercises For these exercises, we will use data from the Human Connectome Project (HCP). This data is accessible in Excel format from data/hcpdata1.xlsx. In the subset of the HCP data set that we will look at, there are 500 subjects for which the gender, age, height and weight for each individual subject is recorded. Save the Excel file to a local directory that you can access with Matlab. You can load the data into Matlab using t = readtable( 'hcpdata1'); For these exercises, it will be helpful to read the documentation of the histogram, scatter, and normpdf functions.

9 CHAPTER 1. BASIC DATA ANALYSIS 8 Figure 1.1: Example visualization of an empirical distribution of two different samples Distribution of Heights HCP Sample Population 0.1 Density Height (inches) Figure 1.2: Example visualization of an empirical distribution and a theoretical population distribution

10 CHAPTER 1. BASIC DATA ANALYSIS Height F Gender M Figure 1.3: Example boxplot visualization 1. Recreate Figure 1.1 as best as possible. This figure shows the histogram of weights for males and females. Note that the vertical axis shows probabilities, not counts. The width of the bins is 5 lbs. 2. Recreate Figure 1.2 as best as possible. This figure shows the histogram of heights (regardless of gender). Note that the vertical axis shows density. The figure also has an overlay of the population distribution of adult height. For this population distribution, you can use a Normal distribution with mean and standard deviation of 66.6 and 4.7 inches respectively. 3. Recreate Figure 1.3 as best as possible using the boxplot function. 4. Recreate Figure 1.4, upper panel, as best as possible. This figure shows a scatter plot of the heights and weights for each gender. 5. Find some data that you find interesting and visualize it with a custom Matlab figure where you change some of the default parameters. ** 6 Recreate Figure 1.4, bottom panel, as best as possible. Note that this figure is a better visualization of the relationship between two variables in case there are several (x,y) pairs that are identical or visually very similar. In this plot, also known as a bubble plot, the bubble size indicates the frequency of encountering a (x,y) pair in a particular region of the space. One way to approach this problem is to use the scatter function and scale the sizes of the markers with the observation frequency. In this particular visualization, the markers were made transparent to help visualize the bubbles for multiple groups.

11 CHAPTER 1. BASIC DATA ANALYSIS Heights and Weights of HCP Subjects Female Male 250 Weight Height Heights and Weights of HCP Subjects Female Male 250 Weight Height Figure 1.4: Example visualizations with scatter plots using the standard scatter function (top) and a custom-made bubble plot function with transparant patch objects (bottom)

12 Chapter 2 Dimensionality Reduction An important part of exploratory data analysis is to get an understanding of the structure of the data, especially when a large number of variables or measurements are involved. Modern data sets are often high-dimensonal. For example, in neuroimaging studies involving EEG, brain signals are measured at many (often 100+) electrodes on the scalp. In fmri studies, the BOLD signal is typically measured for over 100K voxels. When analyzing text documents, the raw data might consist of counts of words in different documents which can lead to extremely large matrices (e.g., how many times is the word X mentioned in document Y). With these many measurements, it is challenging to visualize and understand the raw data. In the absence of any specific theory to analyze the data, it can be very useful to apply dimensionality-reduction techniques. Specifically, the goal might be to find a low-dimensional ( simpler ) description of the original high dimensional data. There are a number of dimensionality reduction techniques. We will discuss two standard approaches: independent component analysis (ICA) and principal component analysis (PCA). 2.1 Independent Component Analysis Note: the material in this section is based a tutorial on ICA by Hyvarinen and Oja (2000) which can be found at and material from the Computational Statistics Handbook with Matlab by Martinez and Martinez. The easiest way to understand ICA is to think about the blind-source separation problem. In this context, a number of signals (i.e., measurements or variables) are observed and each signal is believed to be a linear combination of some unobserved source signals. The goal in blind-source separation is to infer the unobserved source signals without the aid of information about the nature of the signals Let s give a concrete example through the cocktail-party problem (see Figure 2.1). Suppose you are attending a party with several simultaneous conversations. Several microphones, located at different places in the room, are simultaneously recording the conversations. Each microphone recording can be considered as a linear mixture of each independent conver- 11

13 CHAPTER 2. DIMENSIONALITY REDUCTION 12 Figure 2.1: Illustration of the blind-source separation problem sation. How can we infer what the original conversations were at each location from the observed mixtures? Applications of ICA There are a number of applications of ICA that are related to blind-source separation. First, ICA has been applied to natural images in order to understand the statistical properties of real-world natural scenes. The independent components extracted from natural scenes turn out to be qualitatively similar to the visual receptive fields found in early visual cortex (see Figure 2.2), suggesting that the visual cortex might be operating in a manner similar to ICA to find the independent structure in visual scenes. Another application of ICA is in the analysis of resting state data in fmri analysis. For example, when subjects are not performing any particular task (they are resting ) while in the scanner, the correlations in the hemodynamic response pattern across brain regions suggests the presence of functional networks that might support a variety of cognitive functions. ICA has been applied to the BOLD activation to find the independent components of brain activation (see Figure 2.3 where each independent component is hypothesized to represent a functional network. Finally, another application of ICA is to remove artifacts in neuroimaging data. For example, when recording EEG data, eye blinks, muscle, and heart noise can significantly contaminate the EEG signal. Figure 2.4 shows a 3-second portion of EEG data across 20 locations. The bottom two time series in the left panel shows the electrooculographic (EOG) signal that measures eye blinks. Note that an eye-blink occurs around 1.8 seconds. Some of the derived ICA components show a sensitivity to the occurrence of these eye-blinks (for example, IC1 and IC2). The EEG signals can be corrected by removing the signals

14 CHAPTER 2. DIMENSIONALITY REDUCTION 13 Figure 2.2: Independent components extracted from natural images Figure 2.3: Independent components extracted from resting state fmri. (from Storti et al. (2013). Frontiers in Neuroscience)

15 CHAPTER 2. DIMENSIONALITY REDUCTION 14 Figure 2.4: Recorded EEG time series and its ICA component activations, the scalp topographies of four selected components, and the artifact-corrected EEG signals obtained by removing four selected EOG and muscle noise components from the data. From Jung et al. (2000). Psychophysiology, 37, corresponding to the bad independent components. Therefore, ICA can serve as a tool to preprocessing that data in a way to remove any undesirable artifacts How does ICA work? In this section, we will illustrate the ideas behind ICA without any in-depth mathematical treatment. To go back to the problem of blind-source separation, suppose there are two source signals, denoted by s 1 (t) and s 2 (t) and two microphones that record the mixture signals x 1 (t) and x 2 (t). Let s assume that each of the recorded signals is a linear weighted combination of the source signals. We can then express the relationship between x and s as a linear equation: x 1 (t) = a 11 s 1 (t) + a 12 s 2 (t) x 2 (t) = a 21 s 1 (t) + a 22 s 2 (t) (2.1) Note that a 11, a 12, a 21, and a 22 are weighting parameters that determine how the original signals are mixed. The problem now is to estimate the original signals s from just the

16 CHAPTER 2. DIMENSIONALITY REDUCTION 15 observed mixtures x without knowing the mixing parameters a and having very little knowledge about the nature of the original signals. Independent component analysis (ICA) is one technique to approach this problem. Note that ICA is formulated as a linear model the original source signals are combined in a linear fashion to obtain the observed signals. We can rewrite the model such that the observed signals are the inputs and we use weights w 11, w 12, w 21, and w 22 to create new outputs y 1 (t) and y 2 (t): y 1 (t) = w 11 x 1 (t) + w 12 x 2 (t) y 2 (t) = w 21 x 1 (t) + w 22 x 2 (t) (2.2) We can also write the previous two equations in matrix form: Y = W X X = A S (2.3) Through linear algebra 1, it can be shown that if we set the weight matrix W to the inverse of the original weight matrix A, then Y=S. This means that there exists a set of weights W that will transform the observed mixtures X to the source signals S. The question now is how to find the appropriate set of weights W. In the examples below, we will first show some demonstrations of ICA with a function that will automatically produce the independent components. We will then give some example code that will illustrate the method behind ICA Matlab examples There are many different toolboxes for ICA. We will use the fastica toolbox because it is simple and fast. It is available at dlcode.shtml. Note that when you need to apply ICA to specialized tasks such as EEG artifact removal or resting-state analysis, there are more suitable Matlab packages for these tasks. To get started, let s give a few demonstrations of ICA in the context of blind source separation. Listing 2.1 shows Matlab code that produces the visual output show in Figure 2.5. In this example, there are two input signals corresponding to sine waves of a different wave lengths. These two signals are weighted to produce two mixtures x 1 and x 2. These are provided as inputs to the function fastica which produces the independent components y 1 and y 2. Note how the reconstructed signals are not quite the same as the origal signals. The first independent component (y 1 ) is similar to the second source signal (s 2 ) and the second independent component (y 2 ) is similar to the first source signal (s 1 ), but in both cases, the amplitudes are different. This highlights two properties of ICA: the original ordering and variances of the source signals cannot be recovered. 1 Note that if W = A 1, we can write A 1 X = A 1 AS = S

17 CHAPTER 2. DIMENSIONALITY REDUCTION 16 Listing 2.1: Demonstration of ICA to mixed sine wave signals. 1 %% ICA Demo 1: Mix two sine waves and unmix them with fastica 2 % (file = icademo1.m) 3 4 % Create two source signals (sine waves with different frequencies) 5 s1 = sin(linspace(0,10, 1000)); 6 s2 = sin(linspace(0,17, 1000)+5); 7 8 rng( 1 ); % set the random seed for replicability 9 10 % plot the source signals 11 figure(1); clf; % and 12 subplot(2,3,1); plot(s1,'r'); ylabel( 's 1' ); 13 title( 'Original Signals' ); 14 subplot(2,3,4); plot(s2,'r'); ylabel( 's 2' ); % Mix these sources to create two observed signals 17 x1 = 1.00*s1 2.00*s2; % mixed signal 1 18 x2 = 1.73*s *s2; % mixed signal 2 19 subplot(2,3,2); plot(x1); ylabel( 'x 1' ); % plot observed mixed signal 1 20 title( 'Mixed Signals' ); 21 subplot(2,3,5); plot(x2); ylabel( 'x 2' ); % plot observed mixed signal % Apply ICA using the fastica function 24 y = fastica([x1;x2]); % plot the unmixed (reconstructed) signals 27 subplot(2,3,3); plot(y(1,:),'g'); ylabel( 'y 1' ) 28 title( 'Reconstructed Signals' ); 29 subplot(2,3,6); plot(y(2,:),'g'); ylabel( 'y 2' ) 1 Original Signals 5 Mixed Signals 2 Reconstructed Signals s 1 0 x 1 0 y s 2 0 x y Figure 2.5: Output of matlab code of ICA applied to sine-wave signals

18 CHAPTER 2. DIMENSIONALITY REDUCTION 17 To motivate the method behind ICA, look at Matlab code 2.2 that produces the visual output show in Figure 2.6 and Figure 2.7. In this case, we take uniformly distributed random number sequences as independent source signals s 1 and s 2 and we linearly combine them to produce the mixtures x 1 and x 2. Figure 2.6 shows the marginal and joint distribution of the source signals. This is exactly as you would expect the marginal distribution is approximately uniform (because we used uniform distributions) and the joint distribution shows no dependence between s 1 and s 2 because we independently generated the two source signals. The marginal and joint distribution of the mixtures looks very different, as shown in Figure 2.7. In this case, the joint distribution reveals a dependence between the two x-values learning about one x value gives us information about possible values of the other x value. The key observation here is about the marginal distribution of x 1 and x 2. Note that these distributions are more Gaussian shaped than the original source distribution. There are good theoretical reasons to expect this result the Central Limit Theorem, a classical result in probability theory, tells that the distribution of a sum of independent random variables tends toward a Gaussian distribution (under certain conditions). That brings us to the key idea behind ICA Nongaussian is independent. If we assume that the input signals are non-gaussian (a reasonable assumption for many natural signals), and mixtures of non-gaussians will look more Gaussian, then we can try to find linear combinations of the mixture signals that will make the resulting outputs (Y) look more non-gaussian. In other words, by finding weights W that will make Y look less Gaussian, we can find the original source signals S. How do we achieve this? There are a number of ways to assess how non-gaussian a signal is. One is by measuring the kurtosis of a signal. The kurtosis of a random variable is a measure of the peakedness or skewness of the probability distribution. Gaussian random variables are symmetric and have zero kurtosis. Non-gaussian random variables (generally) have non-zero kurtosis. The methods behind ICA find the weights W that maximize the kurtosis in order to find one of the independent components. This component is then subtracted from the remaining mixture and the process of optimizing the weights is repeated to find the remaining independent components. Exercises For some of the exercises, you will need the fastica toolbox which is available from Unzip the files into some folder accessible to you and add the folder to the matlab path. 1. Create some example code to demonstrate that sums of non-gaussian random variables become more gaussian. For example, create some random numbers from an exponential distribution (or some other non-gaussian distribution). Plot a histogram of these random numbers. Now create new random numbers where each random number is the mean of (say) 5 random numbers. Plot a histogram of these new random numbers. Do they look more gaussian? Check that the kurtosis of the new random numbers is lower.

19 CHAPTER 2. DIMENSIONALITY REDUCTION 18 Listing 2.2: Matlab code to visualize joint and marginal distribution of source signals and mixed signals. 1 %% ICA Demo 2 % (file = icademo3.m) 3 % Plot the marginal distributions of the original and the mixture 4 5 % Create two source signals (uniform distributions) 6 s1 = unifrnd( 0, 1, 1,1000 ); 7 s2 = unifrnd( 0, 1, 1,1000 ); 8 9 % make the signals of equal length 10 minsize = min( [ length( s1 ) length( s2 ) ] ); 11 s1 = s1( 1:minsize ); s2 = s2( 1:minsize ); % normalize the variance 14 s1 = s1 / std( s1 ); s2 = s2 / std( s2 ); % Mix these sources to create two observed signals 17 x1 = 1.00*s1 2.00*s2; % mixed signal 1 18 x2 = 1.73*s *s2; % mixed signal figure(3); clf; % and plot the source signals 21 scatterhist( s1, s2, 'Marker', '.' ); 22 title( 'Joint and marginal distribution of s1 and s2' ); figure(4); clf; % and plot the source signals 25 scatterhist( x1, x2, 'Marker', '.' ); 26 title( 'Joint and marginal distribution of x1 and x2' );

20 CHAPTER 2. DIMENSIONALITY REDUCTION 19 Joint and marginal distribution of s1 and s s s1 Figure 2.6: Joint and marginal distribution of two uniformly distributed source signals

21 CHAPTER 2. DIMENSIONALITY REDUCTION 20 Joint and marginal distribution of x1 and x x x1 Figure 2.7: Joint and marginal distribution of linear combinations of two uniformly distributed source signals

22 CHAPTER 2. DIMENSIONALITY REDUCTION Adapt the example code in 2.1 to work with sound files. Matlab has some example audio files available (although you should feel free to use your own). For example, the code s1 = load( 'chirp'); s1=s1.y'; s2 = load( 'handel'); s2=s2.y'; will load a chirp signal and part of the Hallelujah chorus from Handel. These signals are not of equal length yet, so you ll need to truncate the longer signal to make them of equal length. Check that the reconstructed audio signals sound like the original ones. You can use the soundsc function to play vectors as sounds. 3. Adapt the example code in 2.2 to work with sound files. Check that the marginal distributions of the mixtures are more Gaussian than the original source signals by computing the kurtosis. 4. Is the ICA procedure sensitive to the ordering of the measurements? For example, suppose we randomly scramble the temporal ordering of the sound files in the last exercise but we apply the same scramble to each sound signal (e.g., the amplitude at time 1 might become the amplitude at time 65). Obviously, the derived independent components will be different but suppose we unscramble the independent components (e.g., the amplitude at time 65 now becomes the amplitude at time 1), are the results the same? What is the implication of this result? 5. Find some neuroimaging data such as EEG data from multiple electrodes or fmri data and apply ICA to find the independent components. Alternatively, find a set of (grayscale) natural images and apply ICA to the set of images (you will have to convert the two-dimensional image values to a one-dimensional vector of gray-scale values).

23 Part II Probabilistic Modeling 22

24 Chapter 3 Sampling from Random Variables Probabilistic models proposed by researchers are often too complicated for analytic approaches. Increasingly, researchers rely on computational, numerical-based methods when dealing with complex probabilistic models. By using a computational approach, the researcher is freed from making unrealistic assumptions required for some analytic techniques (e.g. such as normality and independence). The key to most approximation approaches is the ability to sample from distributions. Sampling is needed to predict how a particular model will behave under some set of circumstances, and to find appropriate values for the latent variables ( parameters ) when applying models to experimental data. Most computational sampling approaches turn the problem of sampling from complex distributions into subproblems involving simpler sampling distributions. In this chapter, we will illustrate two sampling approaches: the inverse transformation method and rejection sampling. These approaches are appropriate mostly for the univariate case where we are dealing with single-valued outcomes. In the next chapter, we discuss Markov chain Monte Carlo approaches that can operate efficiently with multivariate distributions. 3.1 Standard distributions Some distributions are used so often, that they become part of a standard set of distributions supported by Matlab. The Matlab Statistics Toolbox supports a large number of probability distributions. Using Matlab, it becomes quite easy to calculate the probability density, cumulative density of these distributions, and to sample random values from these distributions. Table 3.1 lists some of the standard distributions supported by Matlab. The Matlab documentation lists many more distributions that can be simulated with Matlab. Using online resources, it is often easy to find support for a number of other common distributions. To illustrate how we can use some of these functions, Listing 3.1 shows Matlab code that visualizes the Normal(µ, σ) distribution where µ = 100 and σ = 15. To make things concrete, imagine that this distribution represents the observed variability of IQ coefficients in some population. The code shows how to display the probability density and the cumulative 23

25 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 24 Table 3.1: Examples of Matlab functions for evaluating probability density, cumulative density and drawing random numbers Distribution PDF CDF Random Number Generation Normal normpdf normcdf norm Uniform (continuous) unifpdf unifcdf unifrnd Beta betapdf betacdf betarnd Exponential exppdf expcdf exprnd Uniform (discrete) unidpdf unidcdf unidrnd Binomial binopdf binocdf binornd Multinomial mnpdf mnrnd Poisson poisspdf poisscdf poissrnd Probability Density Function x cdf Cumulative Density Function x frequency Histogram of random values x Figure 3.1: Illustration of the Normal(µ, σ) distribution where µ = 100 and σ = 15. density. It also shows how to draw random values from this distribution and how to visualize the distribution of these random samples using the hist function. The code produces the output as shown in Figure 3.1. Similarly, Figure 3.2 visualizes the discrete distribution Binomial(N, θ) distribution where N = 10 and θ = 0.7. The binomial arises in situations where a researcher counts the number of successes out of a given number of trials. For example, the Binomial(10, 0.7) distribution represents a situation where we have 10 total trials and the probability of success at each trial, θ, equals 0.7. Exercises 1. Adapt the Matlab program in Listing 3.1 to illustrate the Beta(α, β) distribution where α = 2 and β = 3. Similarly, show the Exponential(λ) distribution where λ = Adapt the matlab program above to illustrate the Binomial(N, θ) distribution where N = 10 and θ = 0.7. Produce an illustration that looks similar to Figure Write a demonstration program to sample 10 values from a Bernoulli(θ) distribution with θ = 0.3. Note that the Bernoulli distribution is one of the simplest discrete distri-

26 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 25 Listing 3.1: Matlab code to visualize Normal distribution. 1 %% Explore the Normal distribution N( mu, sigma ) 2 mu = 100; % the mean 3 sigma = 15; % the standard deviation 4 xmin = 70; % minimum x value for pdf and cdf plot 5 xmax = 130; % maximum x value for pdf and cdf plot 6 n = 100; % number of points on pdf and cdf plot 7 k = 10000; % number of random draws for histogram 8 9 % create a set of values ranging from xmin to xmax 10 x = linspace( xmin, xmax, n ); 11 p = normpdf( x, mu, sigma ); % calculate the pdf 12 c = normcdf( x, mu, sigma ); % calculate the cdf figure( 1 ); clf; % create a new figure and clear the contents subplot( 1,3,1 ); 17 plot( x, p, 'k ' ); 18 xlabel( 'x' ); ylabel( 'pdf' ); 19 title( 'Probability Density Function' ); subplot( 1,3,2 ); 22 plot( x, c, 'k ' ); 23 xlabel( 'x' ); ylabel( 'cdf' ); 24 title( 'Cumulative Density Function' ); % draw k random numbers from a N( mu, sigma ) distribution 27 y = normrnd( mu, sigma, k, 1 ); subplot( 1,3,3 ); 30 hist( y, 20 ); 31 xlabel( 'x' ); ylabel( 'frequency' ); 32 title( 'Histogram of random values' ); 0.35 Probability Distribution 1 Cumulative Probability Distribution 30 Histogram Probability Cumulative Probability Frequency x x x Figure 3.2: Illustration of the Binomial(N, θ) distribution where N = 10 and θ = 0.7.

27 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 26 butions to simulate. There are only two possible outcomes, 0 and 1. With probability θ, the outcome is 1, and with probability 1 θ, the outcome is 0. In other words, p(x = 1) = θ, and p(x = 0) = 1 θ. This distribution can be used to simulate outcomes in a number of situations, such as head or tail outcomes from a weighted coin, correct/incorrect outcomes from true/false questions, etc. In Matlab, you can simulate the Bernoulli distribution using the binomial distribution with N = 1. However, for the purpose of this exercise, please write the code needed to sample Bernoulli distributed values that does not make use of the built-in binomial distribution. 4. It is often useful in simulations to ensure that each replication of the simulation gives the exact same result. In Matlab, when drawing random values from distributions, the values are different every time you restart the code. There is a simple way to seed the random number generators to insure that they produce the same sequence. Write a Matlab script that samples two sets of 10 random values drawn from a uniform distribution between [0,1]. Use the seeding function between the two sampling steps to demonstrate that the two sets of random values are identical. Your Matlab code could use the following line: seed=1; rng( seed ); 5. Suppose we know from previous research that in a given population, IQ coefficients are Normally distributed with a mean of 100 and a standard deviation of 15. Calculate the probability that a randomly drawn person from this population has an IQ greater than 110 but smaller than 130. You can achieve this using one line of matlab code. What does this look like? ** 6 The Dirichlet distribution is currently not supported by Matlab. Can you find a matlab function, using online resources, that implements the sampling from a Dirichlet distribution? 3.2 Sampling from non-standard distributions Suppose we wish to sample from a distribution that is not one of the standard distributions that is supported by Matlab. In modeling situations, this situation frequently arises, because a researcher can propose new noise processes or combinations of existing distributions. Computational methods for solving complex sampling problems often rely on sampling distributions that we do know how to sample from efficiently. The random values from these simple distributions can then be transformed or compared to the target distribution. In fact, some of the techniques discussed in this section are used by Matlab internally to sample from distributions such as the Normal and Exponential distributions.

28 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES Inverse transform sampling with discrete variables Inverse transform sampling (also known as the inverse transform method) is a method for generating random numbers from any probability distribution given the inverse of its cumulative distribution function. The idea is to sample uniformly distributed random numbers (between 0 and 1) and then transform these values using the inverse cumulative distribution function. The simplicity of this procedure lies in the fact that the underlying sampling is just based on transformed uniform deviates. This procedure can be used to sample many different kinds of distributions. In fact, this is how Matlab implements many of its random number generators. It is easiest to illustrate this approach on a discrete distribution where we know the probability of each individual outcome. In this case, the inverse transform method just requires a simple table lookup. To give an example of a non-standard discrete distribution, we use some data from experiments that have looked at how well humans can produce uniform random numbers (e.g. Treisman and Faulkner, 1987). In these experiments, subjects produce a large number of random digits (0,..,9) and investigators tabulate the relative frequencies of each random digit produced. As you might suspect, subjects do not always produce uniform distributions. Table shows some typical data. Some of the low and the high numbers are underrepresented while some specific digits (e.g. 4) are overrepresented. For some reason, the digits 0 and 9 were never generated by the subject (perhaps because the subject misinterpreted the instructions). In any case, this data is fairly typical and demonstrates that humans are not very good are producing uniformly distributed random numbers. Table 3.2: Probability of digits observed in human random digit generation experiment. The generated digit is represented by X; p(x) and F (X) are the probability mass and cumulative probabilities respectively. The data was estimated from subject 6, session 1, in experiment by Treisman and Faulkner (1987). X p(x) F (X) Suppose we now want to mimic this process and write an algorithm that samples digits

29 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 28 according to the probabilities shown in Table Therefore, the program should produce a 4 with probability.2, a 5 with probability.175, etc. For example, the code in Listing 3.2 implements this process using the built-in matlab function randsample. The code produces the illustration shown in Figure Instead of using the built-in functions such as randsample or mnrnd, it is helpful to consider how to implement the underlying sampling algorithm using the inverse transform method. We first need to calculate the cumulative probability distribution. In other words, we need to know the probability that we observe an outcome equal to or smaller than some particular value. If F (X) represents the cumulative function, we need to calculate F (X = x) = p(x <= x). For discrete distributions, this can be done using simple summation. The cumulative probabilities of our example are shown in the right column of Table In the inverse transform algorithm, the idea is to sample uniform random deviates (i.e., random numbers between 0 and 1) and to compare each random number against the table of cumulative probabilities. The first outcome for which the random deviate is smaller than (or is equal to) the associated cumulative probability corresponds to the sampled outcome. Figure shows an example with a uniform random deviate of U = 0.8 that leads to a sampled outcome X = 6. This process of repeated sampling of uniform deviates and comparing these to the cumulative distribution forms the basis for the inverse transform method for discrete variables. Note that we are applying an inverse function, because we are doing an inverse table lookup. Listing 3.2: Matlab code to simulate sampling of random digits. 1 % Simulate the distribution observed in the 2 % human random digit generation task 3 4 % probabilities for each digit 5 theta = [0.000;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ]... % digit % fix the random number generator 17 seed = 1; rand( 'state', seed ); % let's say we draw K random values 20 K = 10000; 21 digitset = 0:9; 22 Y = randsample(digitset,k,true,theta); 23

30 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES % create a new figure 25 figure( 1 ); clf; % Show the histogram of the simulated draws 28 counts = hist( Y, digitset ); 29 bar( digitset, counts, 'k' ); 30 xlim( [ ] ); 31 xlabel( 'Digit' ); 32 ylabel( 'Frequency' ); 33 title( 'Distribution of simulated draws of human digit generator' ); 2500 Distribution of simulated draws of human digit generator 2000 Frequency Digit Figure 3.3: Illustration of the BINOMIAL(N, θ) distribution where N = 10 and θ = 0.7. Exercises 1. Create the Matlab program that implements the inverse tranform method for discrete variables. Use it to sample random digits with probabilities as shown in Table In order to show that the algorithm is working, sample a large number of random digits and create a histogram. Your program should never sample digits 0 and 9 as they are given zero probability in the table. ** 2 One solution to the previous exercise that does not require any loops is by using the multinomial random number generator mnrnd. Show how to use this function to sample digits according to the probabilities shown in Table

31 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES F(X) X Figure 3.4: Illustration of the inverse transform procedure for generating discrete random variables. Note that we plot the cumulative probabilities for each outcome. If we sample a uniform random number of U = 0.8, then this yields a random value of X = 6 ** 3 Explain why the algorithm as described above might be inefficient when dealing with skewed probability distributions. [hint: imagine a situation where the first N-1 outcomes have zero probability and the last outcome has probability one]. Can you think of a simple change to the algorithm to improve its efficiency? Inverse transform sampling with continuous variables The inverse transform sampling approach can also be applied to continuous distributions. Generally, the idea is to draw uniform random deviates and to apply the inverse function of the cumulative distribution applied to the random deviate. In the following, let F (X) be the cumulative density function (CDF) of our target variable X and F 1 (X) the inverse of this function, assuming that we can actually calculate this inverse. We wish to draw random values for X. This can be done with the following procedure: 1. Draw U Uniform(0, 1) 2. Set X = F 1 (U) 3. Repeat Let s illustrate this approach with a simple example. Suppose we want to sample random numbers from the exponential distribution. When λ > 0, the cumulative density function is F (x λ) = 1 exp( x/λ). Using some simple algebra, one can find the inverse of this function, which is F 1 (u λ) = log(1 u)λ. This leads to the following sampling procedure to sample random numbers from a Exponental(λ) distribution:

32 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES Draw U Uniform(0, 1) 2. Set X = log(1 U)λ 3. Repeat Exercises 1. Implement the inverse transform sampling method for the exponential distribution. Sample a large number of values from this distribution, and show the distribution of these values. Compare the distribution you obtain against the exact distribution as obtained by the PDF of the exponential distribution (use the command exppdf). ** 2 Matlab implements some of its own functions using Matlab code. For example, when you call the exponential random number generator exprnd, Matlab executes a function that is stored in its own internal directories. Please locate the Matlab function exprnd and inspect its contents. How does Matlab implement the sampling from the exponential distribution? Does it use the inverse transform method? Note that the path to this Matlab function will depend on your particular Matlab installation, but it probably looks something like C:\Program Files\MATLAB\R2009B\toolbox\stats\exprnd.m Rejection sampling In many cases, it is not possible to apply the inverse transform sampling method because it is difficult to compute the cumulative distribution or its inverse. In this case, there are other options available, such as rejection sampling, and methods using Markov chain Monte Carlo approaches that we will discuss in the next chapter. The main advantage of the rejection sampling method is that it does not require any burn-in period. Instead, all samples obtained during sampling can immediately be used as samples from the target distribution. One way to illustrate the general idea of rejection sampling (also commonly called the accept-reject algorithm ) is with Figure 3.5. Suppose we wish to draw points uniformly within a circle centered at (0, 0) and with radius 1. At first, it seems quite complicated to directly sample points within this circle in uniform fashion. However, we can apply rejection sampling by first drawing (x, y) values uniformly from within the square surrounding the circle, and rejecting any samples for which x 2 + y 2 > 1. Note that in this procedure, we used a very simple proposal distribution, such as the uniform distribution, as a basis for sampling from a much more complicated distribution. Rejection sampling allows us to generate observations from a distribution that is difficult to sample from but where we can evaluate the probability of any particular sample. In other words, suppose we have a distribution p(θ), and it is difficult to sample from this distribution directly, but we can evaluate the probability density or mass p(θ) for a particular value of θ.

33 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 32 (A) 1 (B) y 0 y x x Figure 3.5: Sampling points uniformly from unit circle using rejection sampling The first choice that the researcher needs to make is the proposal distribution. The proposal distribution is a simple distribution q(θ), that we can directly sample from. The idea is to evaluate the probability of the proposed samples under both the proposal distribution and the target distribution and reject samples that are unlikely under the target distribution relative to the proposal distribution. Figure 3.6 illustrates the procedure. We first need to find a constant c such that cq(θ) p(θ) for all possible samples θ. The proposal function q(θ) multiplied by the constant c is known as the comparison distribution and will always lie on top of our target distribution. Finding the constant c might be non-trivial, but let s assume for now that we can do this using some calculus. We now draw a number u from a uniform distribution between [0, cq(θ)]. In other words, this is some point on the line segment between 0 and the height of the comparison distribution evaluated at our proposal θ. We will reject the proposal if u > p(θ) and accept it otherwise. If we accept the proposal, the sampled value θ is a draw from the target distribution p(θ). Here is a summary of the computational procedure: 1. Choose a density q(θ) that is easy to sample from 2. Find a constant c such that cq(θ) p(θ) for all θ 3. Sample a proposal θ from proposal distribution q(θ) 4. Sample a uniform deviate u from the interval [0, cq(θ)] 5. Reject the proposal if u > p(θ), accept otherwise 6. Repeat steps 3, 4, and 5 until desired number of samples is reached; each accepted sample is a draw from p(θ) The key to an efficient operation of this algorithm is to have as many samples accepted as possible. This depends crucially on the choice of the proposal distribution. A proposal

34 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 33 cq ( ) p ( ) u Figure 3.6: Illustration of rejection sampling. The particular sample shown in the figure will be rejected distribution that is dissimilar to the target distribution will lead to many rejected samples, slowing the procedure down. Exercises 1. Suppose we want to sample from a Beta(α, β) distribution where α = 2 and β = 1. This gives the probability density p(x) = 2x for 0 < x < 1. Assume that for whatever reason, we do not have access to the Beta random number generator that Matlab provides. Instead, the goal is to implement a rejection sampling algorithm in Matlab that samples from this distribution. For this exercise, use a simple uniform proposal distribution (even though this is not a good choice as a proposal distribution). The constant c should be 2 in this case. Visualize the histogram of sampled values and verify that the distribution matches the histogram obtained by using Matlab s betarnd sampling function. What is the percentage of accepted samples? How might we improve the rejection sampler? ( 2ln(x 2 +y 2 ) x 2 +y 2 ** 2 The procedure shown in Figure 3.5 forms the basis for the Box-Muller method for generating Gaussian distributed random variables. We first generate uniform coordinates (x, y) from the unit circle using the rejection sampling procedure that rejects any (x, y) pair with x 2 + y 2 > 1. Then, for each pair (x, y) we evaluate the quanti- ( ) 1/2 ) ties z 1 = x 2ln(x 2 +y 2 ) 1/2. x 2 +y and 2 z2 = y The values z1 and z 2 are each Gaussian distributed with zero mean and unit variance. Write a Matlab program that implements this Box-Muller method and verify that the sampled values are Gaussian distributed.

35 Chapter 4 Markov Chain Monte Carlo The application of probabilistic models to data often leads to inference problems that require the integration of complex, high dimensional distributions. Markov chain Monte Carlo (MCMC), is a general computational approach that replaces analytic integration by summation over samples generated from iterative algorithms. Problems that are intractable using analytic approaches often become possible to solve using some form of MCMC, even with high-dimensional problems. The development of MCMC is arguably the biggest advance in the computational approach to statistics. While MCMC is very much an active research area, there are now some standardized techniques that are widely used. In this chapter, we will discuss two forms of MCMC: Metropolis-Hastings and Gibbs sampling. Before we go into these techniques though, we first need to understand the two main ideas underlying MCMC: Monte Carlo integration, and Markov chains. 4.1 Monte Carlo integration Many problems in probabilistic inference require the calculation of complex integrals or summations over very large outcome spaces. For example, a frequent problem is to calculate the expectation of a function g(x) for the random variable x (for simplicity, we assume x is a univariate random variable). If x is continuous, the expectation is defined as: E[g(x)] = g(x)p(x)dx (4.1) In the case of discrete variables, the integral is replaced by summation: E[g(x)] = g(x)p(x)dx (4.2) These expectations arise in many situations where we want to calculate some statistic of a distribution, such as the mean or variance. For example, with g(x) = x, we are calculating the mean of a distribution. Integration or summation using analytic techniques can become quite challenging for certain distributions. For example, the density p(x) might have a 34