Advanced Matlab: Exploratory Data Analysis and Computational Statistics. Mark Steyvers

Size: px
Start display at page:

Download "Advanced Matlab: Exploratory Data Analysis and Computational Statistics. Mark Steyvers"

Transcription

1 Advanced Matlab: Exploratory Data Analysis and Computational Statistics Mark Steyvers January 14, 2015

2 Contents I Exploratory Data Analysis 4 1 Basic Data Analysis Organizing and Summarizing Data Visualizing Data Dimensionality Reduction Independent Component Analysis Applications of ICA How does ICA work? Matlab examples II Probabilistic Modeling 22 3 Sampling from Random Variables Standard distributions Sampling from non-standard distributions Inverse transform sampling with discrete variables Inverse transform sampling with continuous variables Rejection sampling Markov Chain Monte Carlo Monte Carlo integration Markov Chains Putting it together: Markov chain Monte Carlo Metropolis Sampling Metropolis-Hastings Sampling Metropolis-Hastings for Multivariate Distributions Blockwise updating Componentwise updating Gibbs Sampling

3 CONTENTS 2 5 Basic concepts in Bayesian Data Analysis Parameter Estimation Approaches Maximum Likelihood Maximum a posteriori Posterior Sampling Example: Estimating a Weibull distribution Directed Graphical Models A Short Review of Probability Theory The Burglar Alarm Example Conditional probability tables Explaining away Joint distributions and independence relationships Graphical Model Notation Example: Consensus Modeling with Gaussian variables Sequential Monte Carlo Hidden Markov Models Example HMM with discrete outcomes and states Viterbi Algorithm Bayesian Filtering Particle Filters Sampling Importance Resampling (SIR) Direct Simulation

4 Note to Students Exercises This course book contains a number of exercises in which you are asked to simulate Matlab code, produce new code, as well as produce graphical illustrations and answers to questions. The exercises marked with ** are optional exercises that can be skipped when time is limited. Matlab documentation It will probably happen many times that you will need to find the name of a Matlab function or a description of the input and output variables for a given Matlab function. It is strongly recommended to have the Matlab documentation running in a separate window for quick consultation. You can access the Matlab documentation by typing doc in the command window. For specific help on a given matlab function, such as the function fprintf, you can type doc fprintf to get a help screen in the matlab documentation window or help fprintf to get a description in the matlab command window. Organizing answers to exercises It is helpful to maintain a document that organizes all the material related to the exercises. Matlab can facilitate part of this organization using the publish option. For example, if you have a Matlab script that produces a figure, you can publish the code as well as the figure produced by the code to a single external document such as pdf. You can find the publishing option in the Matlab editor under the publish menu. You can change the publish configuration (look under the file menu of the editor window) to produce pdfs by changing the output file format under the edit configurations window. 3

5 Part I Exploratory Data Analysis 4

6 Chapter 1 Basic Data Analysis 1.1 Organizing and Summarizing Data When analyzing any kind of data, it is important to make the data available in an intuitive representation that is easy to change. It is also useful when sharing data with other researchers to package all experimental or modeling data into a single container that can be easily shared and documented. Matlab has two ways to package data into a single variable that can contain many different types of variables. One method is to use structures. Another method that was recently introduced in Matlab is based on tables. In tables, standard indexing can be used to access individual elements or sets of elements. Suppose that t is a table. The element in the fourth row, second column can be accessed by t(4,2). The result of this could be a scalar value, a string, cell array or whatever type of variable is stored in that element of the table. To access the second row of the table, you can use t(2,:). Similarly, to access the second column of the table, you can use t(:,2). Note that the last two operations return another table as a result. Another way to access tables is by using the names of the columns. Suppose the name of the second column is Gender. To access all the value from this column, you can use t.gender(:). Below is some Matlab code to create identical data representations using either a structure or a table. The gray text shows the resulting output produced from the command window. From the example code, it might not be apparent what the relative advantages and disadvantages are of structures and tables. The following exercises hopefully makes it clearer why using the new table format might be advantageous. It will be useful to read the matlab documentation on structures and tables under /LanguageFundamentals/ DataTypes/Structures and /LanguageFundamentals/DataTypes/Tables % Create a structure "d" with some example data d.age = [ ] ; d.gender = { Male, Female, Female, Male } ; d.id = [ ] ; 5

7 CHAPTER 1. BASIC DATA ANALYSIS 6 % Show d disp( d ) % Create a Table "t" with the same information t = table( [ ],... { Male, Female, Female, Male },... [ ],... VariableNames,{ Age, Gender, ID } ); % Show this table disp( t ) % Copy the Age values to a variable X x = t.age % Extract the second row of the table row = t( 2, : ) Age: [4x1 double] Gender: {4x1 cell} ID: [4x1 double] Age Gender ID 32 Male Female Female Male 445 x = row = Age Gender ID

8 CHAPTER 1. BASIC DATA ANALYSIS 7 24 Female 433 Exercises 1. In this exercise, we will load some sample data into Matlab and represent the data internally in a structure. In Matlab, execute the following command to create the structure d which contains some sample data about patients: d = table2struct( readtable('patients.dat'), 'ToScalar',true); Show in a single Matlab script how to a) calculate the mean age of the males, b) delete the data entries that correspond to smokers, and c) sort the entries according to age. 2. Let s repeat the last exercise but now represent the data internally with a table. In Matlab, execute the following command to create the table t which contains the same data about patients: t = readtable('patients.dat'); Show in a single Matlab script how to a) extract the first row of the table, b) how to extract the numeric values of Age from the Age column, c) calculate the mean age of the males, d) delete the data entries that correspond to smokers, and e) sort the entries according to age. What is the advantage of using the table representation? 3. With the table representation of the patient data, use the tabulate function to calculate frequency distribution of locations. What percentage of patients are located at the VA hospital? 4. Use the crosstabs function to calculate the contingency table of Gender by Smoker. How many female smokers are there in the sample? 5. Use the prctile function to calculate the 25% and 75% percentile of weights in the sample. 1.2 Visualizing Data Exercises For these exercises, we will use data from the Human Connectome Project (HCP). This data is accessible in Excel format from data/hcpdata1.xlsx. In the subset of the HCP data set that we will look at, there are 500 subjects for which the gender, age, height and weight for each individual subject is recorded. Save the Excel file to a local directory that you can access with Matlab. You can load the data into Matlab using t = readtable( 'hcpdata1'); For these exercises, it will be helpful to read the documentation of the histogram, scatter, and normpdf functions.

9 CHAPTER 1. BASIC DATA ANALYSIS 8 Figure 1.1: Example visualization of an empirical distribution of two different samples Distribution of Heights HCP Sample Population 0.1 Density Height (inches) Figure 1.2: Example visualization of an empirical distribution and a theoretical population distribution

10 CHAPTER 1. BASIC DATA ANALYSIS Height F Gender M Figure 1.3: Example boxplot visualization 1. Recreate Figure 1.1 as best as possible. This figure shows the histogram of weights for males and females. Note that the vertical axis shows probabilities, not counts. The width of the bins is 5 lbs. 2. Recreate Figure 1.2 as best as possible. This figure shows the histogram of heights (regardless of gender). Note that the vertical axis shows density. The figure also has an overlay of the population distribution of adult height. For this population distribution, you can use a Normal distribution with mean and standard deviation of 66.6 and 4.7 inches respectively. 3. Recreate Figure 1.3 as best as possible using the boxplot function. 4. Recreate Figure 1.4, upper panel, as best as possible. This figure shows a scatter plot of the heights and weights for each gender. 5. Find some data that you find interesting and visualize it with a custom Matlab figure where you change some of the default parameters. ** 6 Recreate Figure 1.4, bottom panel, as best as possible. Note that this figure is a better visualization of the relationship between two variables in case there are several (x,y) pairs that are identical or visually very similar. In this plot, also known as a bubble plot, the bubble size indicates the frequency of encountering a (x,y) pair in a particular region of the space. One way to approach this problem is to use the scatter function and scale the sizes of the markers with the observation frequency. In this particular visualization, the markers were made transparent to help visualize the bubbles for multiple groups.

11 CHAPTER 1. BASIC DATA ANALYSIS Heights and Weights of HCP Subjects Female Male 250 Weight Height Heights and Weights of HCP Subjects Female Male 250 Weight Height Figure 1.4: Example visualizations with scatter plots using the standard scatter function (top) and a custom-made bubble plot function with transparant patch objects (bottom)

12 Chapter 2 Dimensionality Reduction An important part of exploratory data analysis is to get an understanding of the structure of the data, especially when a large number of variables or measurements are involved. Modern data sets are often high-dimensonal. For example, in neuroimaging studies involving EEG, brain signals are measured at many (often 100+) electrodes on the scalp. In fmri studies, the BOLD signal is typically measured for over 100K voxels. When analyzing text documents, the raw data might consist of counts of words in different documents which can lead to extremely large matrices (e.g., how many times is the word X mentioned in document Y). With these many measurements, it is challenging to visualize and understand the raw data. In the absence of any specific theory to analyze the data, it can be very useful to apply dimensionality-reduction techniques. Specifically, the goal might be to find a low-dimensional ( simpler ) description of the original high dimensional data. There are a number of dimensionality reduction techniques. We will discuss two standard approaches: independent component analysis (ICA) and principal component analysis (PCA). 2.1 Independent Component Analysis Note: the material in this section is based a tutorial on ICA by Hyvarinen and Oja (2000) which can be found at and material from the Computational Statistics Handbook with Matlab by Martinez and Martinez. The easiest way to understand ICA is to think about the blind-source separation problem. In this context, a number of signals (i.e., measurements or variables) are observed and each signal is believed to be a linear combination of some unobserved source signals. The goal in blind-source separation is to infer the unobserved source signals without the aid of information about the nature of the signals Let s give a concrete example through the cocktail-party problem (see Figure 2.1). Suppose you are attending a party with several simultaneous conversations. Several microphones, located at different places in the room, are simultaneously recording the conversations. Each microphone recording can be considered as a linear mixture of each independent conver- 11

13 CHAPTER 2. DIMENSIONALITY REDUCTION 12 Figure 2.1: Illustration of the blind-source separation problem sation. How can we infer what the original conversations were at each location from the observed mixtures? Applications of ICA There are a number of applications of ICA that are related to blind-source separation. First, ICA has been applied to natural images in order to understand the statistical properties of real-world natural scenes. The independent components extracted from natural scenes turn out to be qualitatively similar to the visual receptive fields found in early visual cortex (see Figure 2.2), suggesting that the visual cortex might be operating in a manner similar to ICA to find the independent structure in visual scenes. Another application of ICA is in the analysis of resting state data in fmri analysis. For example, when subjects are not performing any particular task (they are resting ) while in the scanner, the correlations in the hemodynamic response pattern across brain regions suggests the presence of functional networks that might support a variety of cognitive functions. ICA has been applied to the BOLD activation to find the independent components of brain activation (see Figure 2.3 where each independent component is hypothesized to represent a functional network. Finally, another application of ICA is to remove artifacts in neuroimaging data. For example, when recording EEG data, eye blinks, muscle, and heart noise can significantly contaminate the EEG signal. Figure 2.4 shows a 3-second portion of EEG data across 20 locations. The bottom two time series in the left panel shows the electrooculographic (EOG) signal that measures eye blinks. Note that an eye-blink occurs around 1.8 seconds. Some of the derived ICA components show a sensitivity to the occurrence of these eye-blinks (for example, IC1 and IC2). The EEG signals can be corrected by removing the signals

14 CHAPTER 2. DIMENSIONALITY REDUCTION 13 Figure 2.2: Independent components extracted from natural images Figure 2.3: Independent components extracted from resting state fmri. (from Storti et al. (2013). Frontiers in Neuroscience)

15 CHAPTER 2. DIMENSIONALITY REDUCTION 14 Figure 2.4: Recorded EEG time series and its ICA component activations, the scalp topographies of four selected components, and the artifact-corrected EEG signals obtained by removing four selected EOG and muscle noise components from the data. From Jung et al. (2000). Psychophysiology, 37, corresponding to the bad independent components. Therefore, ICA can serve as a tool to preprocessing that data in a way to remove any undesirable artifacts How does ICA work? In this section, we will illustrate the ideas behind ICA without any in-depth mathematical treatment. To go back to the problem of blind-source separation, suppose there are two source signals, denoted by s 1 (t) and s 2 (t) and two microphones that record the mixture signals x 1 (t) and x 2 (t). Let s assume that each of the recorded signals is a linear weighted combination of the source signals. We can then express the relationship between x and s as a linear equation: x 1 (t) = a 11 s 1 (t) + a 12 s 2 (t) x 2 (t) = a 21 s 1 (t) + a 22 s 2 (t) (2.1) Note that a 11, a 12, a 21, and a 22 are weighting parameters that determine how the original signals are mixed. The problem now is to estimate the original signals s from just the

16 CHAPTER 2. DIMENSIONALITY REDUCTION 15 observed mixtures x without knowing the mixing parameters a and having very little knowledge about the nature of the original signals. Independent component analysis (ICA) is one technique to approach this problem. Note that ICA is formulated as a linear model the original source signals are combined in a linear fashion to obtain the observed signals. We can rewrite the model such that the observed signals are the inputs and we use weights w 11, w 12, w 21, and w 22 to create new outputs y 1 (t) and y 2 (t): y 1 (t) = w 11 x 1 (t) + w 12 x 2 (t) y 2 (t) = w 21 x 1 (t) + w 22 x 2 (t) (2.2) We can also write the previous two equations in matrix form: Y = W X X = A S (2.3) Through linear algebra 1, it can be shown that if we set the weight matrix W to the inverse of the original weight matrix A, then Y=S. This means that there exists a set of weights W that will transform the observed mixtures X to the source signals S. The question now is how to find the appropriate set of weights W. In the examples below, we will first show some demonstrations of ICA with a function that will automatically produce the independent components. We will then give some example code that will illustrate the method behind ICA Matlab examples There are many different toolboxes for ICA. We will use the fastica toolbox because it is simple and fast. It is available at dlcode.shtml. Note that when you need to apply ICA to specialized tasks such as EEG artifact removal or resting-state analysis, there are more suitable Matlab packages for these tasks. To get started, let s give a few demonstrations of ICA in the context of blind source separation. Listing 2.1 shows Matlab code that produces the visual output show in Figure 2.5. In this example, there are two input signals corresponding to sine waves of a different wave lengths. These two signals are weighted to produce two mixtures x 1 and x 2. These are provided as inputs to the function fastica which produces the independent components y 1 and y 2. Note how the reconstructed signals are not quite the same as the origal signals. The first independent component (y 1 ) is similar to the second source signal (s 2 ) and the second independent component (y 2 ) is similar to the first source signal (s 1 ), but in both cases, the amplitudes are different. This highlights two properties of ICA: the original ordering and variances of the source signals cannot be recovered. 1 Note that if W = A 1, we can write A 1 X = A 1 AS = S

17 CHAPTER 2. DIMENSIONALITY REDUCTION 16 Listing 2.1: Demonstration of ICA to mixed sine wave signals. 1 %% ICA Demo 1: Mix two sine waves and unmix them with fastica 2 % (file = icademo1.m) 3 4 % Create two source signals (sine waves with different frequencies) 5 s1 = sin(linspace(0,10, 1000)); 6 s2 = sin(linspace(0,17, 1000)+5); 7 8 rng( 1 ); % set the random seed for replicability 9 10 % plot the source signals 11 figure(1); clf; % and 12 subplot(2,3,1); plot(s1,'r'); ylabel( 's 1' ); 13 title( 'Original Signals' ); 14 subplot(2,3,4); plot(s2,'r'); ylabel( 's 2' ); % Mix these sources to create two observed signals 17 x1 = 1.00*s1 2.00*s2; % mixed signal 1 18 x2 = 1.73*s *s2; % mixed signal 2 19 subplot(2,3,2); plot(x1); ylabel( 'x 1' ); % plot observed mixed signal 1 20 title( 'Mixed Signals' ); 21 subplot(2,3,5); plot(x2); ylabel( 'x 2' ); % plot observed mixed signal % Apply ICA using the fastica function 24 y = fastica([x1;x2]); % plot the unmixed (reconstructed) signals 27 subplot(2,3,3); plot(y(1,:),'g'); ylabel( 'y 1' ) 28 title( 'Reconstructed Signals' ); 29 subplot(2,3,6); plot(y(2,:),'g'); ylabel( 'y 2' ) 1 Original Signals 5 Mixed Signals 2 Reconstructed Signals s 1 0 x 1 0 y s 2 0 x y Figure 2.5: Output of matlab code of ICA applied to sine-wave signals

18 CHAPTER 2. DIMENSIONALITY REDUCTION 17 To motivate the method behind ICA, look at Matlab code 2.2 that produces the visual output show in Figure 2.6 and Figure 2.7. In this case, we take uniformly distributed random number sequences as independent source signals s 1 and s 2 and we linearly combine them to produce the mixtures x 1 and x 2. Figure 2.6 shows the marginal and joint distribution of the source signals. This is exactly as you would expect the marginal distribution is approximately uniform (because we used uniform distributions) and the joint distribution shows no dependence between s 1 and s 2 because we independently generated the two source signals. The marginal and joint distribution of the mixtures looks very different, as shown in Figure 2.7. In this case, the joint distribution reveals a dependence between the two x-values learning about one x value gives us information about possible values of the other x value. The key observation here is about the marginal distribution of x 1 and x 2. Note that these distributions are more Gaussian shaped than the original source distribution. There are good theoretical reasons to expect this result the Central Limit Theorem, a classical result in probability theory, tells that the distribution of a sum of independent random variables tends toward a Gaussian distribution (under certain conditions). That brings us to the key idea behind ICA Nongaussian is independent. If we assume that the input signals are non-gaussian (a reasonable assumption for many natural signals), and mixtures of non-gaussians will look more Gaussian, then we can try to find linear combinations of the mixture signals that will make the resulting outputs (Y) look more non-gaussian. In other words, by finding weights W that will make Y look less Gaussian, we can find the original source signals S. How do we achieve this? There are a number of ways to assess how non-gaussian a signal is. One is by measuring the kurtosis of a signal. The kurtosis of a random variable is a measure of the peakedness or skewness of the probability distribution. Gaussian random variables are symmetric and have zero kurtosis. Non-gaussian random variables (generally) have non-zero kurtosis. The methods behind ICA find the weights W that maximize the kurtosis in order to find one of the independent components. This component is then subtracted from the remaining mixture and the process of optimizing the weights is repeated to find the remaining independent components. Exercises For some of the exercises, you will need the fastica toolbox which is available from Unzip the files into some folder accessible to you and add the folder to the matlab path. 1. Create some example code to demonstrate that sums of non-gaussian random variables become more gaussian. For example, create some random numbers from an exponential distribution (or some other non-gaussian distribution). Plot a histogram of these random numbers. Now create new random numbers where each random number is the mean of (say) 5 random numbers. Plot a histogram of these new random numbers. Do they look more gaussian? Check that the kurtosis of the new random numbers is lower.

19 CHAPTER 2. DIMENSIONALITY REDUCTION 18 Listing 2.2: Matlab code to visualize joint and marginal distribution of source signals and mixed signals. 1 %% ICA Demo 2 % (file = icademo3.m) 3 % Plot the marginal distributions of the original and the mixture 4 5 % Create two source signals (uniform distributions) 6 s1 = unifrnd( 0, 1, 1,1000 ); 7 s2 = unifrnd( 0, 1, 1,1000 ); 8 9 % make the signals of equal length 10 minsize = min( [ length( s1 ) length( s2 ) ] ); 11 s1 = s1( 1:minsize ); s2 = s2( 1:minsize ); % normalize the variance 14 s1 = s1 / std( s1 ); s2 = s2 / std( s2 ); % Mix these sources to create two observed signals 17 x1 = 1.00*s1 2.00*s2; % mixed signal 1 18 x2 = 1.73*s *s2; % mixed signal figure(3); clf; % and plot the source signals 21 scatterhist( s1, s2, 'Marker', '.' ); 22 title( 'Joint and marginal distribution of s1 and s2' ); figure(4); clf; % and plot the source signals 25 scatterhist( x1, x2, 'Marker', '.' ); 26 title( 'Joint and marginal distribution of x1 and x2' );

20 CHAPTER 2. DIMENSIONALITY REDUCTION 19 Joint and marginal distribution of s1 and s s s1 Figure 2.6: Joint and marginal distribution of two uniformly distributed source signals

21 CHAPTER 2. DIMENSIONALITY REDUCTION 20 Joint and marginal distribution of x1 and x x x1 Figure 2.7: Joint and marginal distribution of linear combinations of two uniformly distributed source signals

22 CHAPTER 2. DIMENSIONALITY REDUCTION Adapt the example code in 2.1 to work with sound files. Matlab has some example audio files available (although you should feel free to use your own). For example, the code s1 = load( 'chirp'); s1=s1.y'; s2 = load( 'handel'); s2=s2.y'; will load a chirp signal and part of the Hallelujah chorus from Handel. These signals are not of equal length yet, so you ll need to truncate the longer signal to make them of equal length. Check that the reconstructed audio signals sound like the original ones. You can use the soundsc function to play vectors as sounds. 3. Adapt the example code in 2.2 to work with sound files. Check that the marginal distributions of the mixtures are more Gaussian than the original source signals by computing the kurtosis. 4. Is the ICA procedure sensitive to the ordering of the measurements? For example, suppose we randomly scramble the temporal ordering of the sound files in the last exercise but we apply the same scramble to each sound signal (e.g., the amplitude at time 1 might become the amplitude at time 65). Obviously, the derived independent components will be different but suppose we unscramble the independent components (e.g., the amplitude at time 65 now becomes the amplitude at time 1), are the results the same? What is the implication of this result? 5. Find some neuroimaging data such as EEG data from multiple electrodes or fmri data and apply ICA to find the independent components. Alternatively, find a set of (grayscale) natural images and apply ICA to the set of images (you will have to convert the two-dimensional image values to a one-dimensional vector of gray-scale values).

23 Part II Probabilistic Modeling 22

24 Chapter 3 Sampling from Random Variables Probabilistic models proposed by researchers are often too complicated for analytic approaches. Increasingly, researchers rely on computational, numerical-based methods when dealing with complex probabilistic models. By using a computational approach, the researcher is freed from making unrealistic assumptions required for some analytic techniques (e.g. such as normality and independence). The key to most approximation approaches is the ability to sample from distributions. Sampling is needed to predict how a particular model will behave under some set of circumstances, and to find appropriate values for the latent variables ( parameters ) when applying models to experimental data. Most computational sampling approaches turn the problem of sampling from complex distributions into subproblems involving simpler sampling distributions. In this chapter, we will illustrate two sampling approaches: the inverse transformation method and rejection sampling. These approaches are appropriate mostly for the univariate case where we are dealing with single-valued outcomes. In the next chapter, we discuss Markov chain Monte Carlo approaches that can operate efficiently with multivariate distributions. 3.1 Standard distributions Some distributions are used so often, that they become part of a standard set of distributions supported by Matlab. The Matlab Statistics Toolbox supports a large number of probability distributions. Using Matlab, it becomes quite easy to calculate the probability density, cumulative density of these distributions, and to sample random values from these distributions. Table 3.1 lists some of the standard distributions supported by Matlab. The Matlab documentation lists many more distributions that can be simulated with Matlab. Using online resources, it is often easy to find support for a number of other common distributions. To illustrate how we can use some of these functions, Listing 3.1 shows Matlab code that visualizes the Normal(µ, σ) distribution where µ = 100 and σ = 15. To make things concrete, imagine that this distribution represents the observed variability of IQ coefficients in some population. The code shows how to display the probability density and the cumulative 23

25 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 24 Table 3.1: Examples of Matlab functions for evaluating probability density, cumulative density and drawing random numbers Distribution PDF CDF Random Number Generation Normal normpdf normcdf norm Uniform (continuous) unifpdf unifcdf unifrnd Beta betapdf betacdf betarnd Exponential exppdf expcdf exprnd Uniform (discrete) unidpdf unidcdf unidrnd Binomial binopdf binocdf binornd Multinomial mnpdf mnrnd Poisson poisspdf poisscdf poissrnd Probability Density Function x cdf Cumulative Density Function x frequency Histogram of random values x Figure 3.1: Illustration of the Normal(µ, σ) distribution where µ = 100 and σ = 15. density. It also shows how to draw random values from this distribution and how to visualize the distribution of these random samples using the hist function. The code produces the output as shown in Figure 3.1. Similarly, Figure 3.2 visualizes the discrete distribution Binomial(N, θ) distribution where N = 10 and θ = 0.7. The binomial arises in situations where a researcher counts the number of successes out of a given number of trials. For example, the Binomial(10, 0.7) distribution represents a situation where we have 10 total trials and the probability of success at each trial, θ, equals 0.7. Exercises 1. Adapt the Matlab program in Listing 3.1 to illustrate the Beta(α, β) distribution where α = 2 and β = 3. Similarly, show the Exponential(λ) distribution where λ = Adapt the matlab program above to illustrate the Binomial(N, θ) distribution where N = 10 and θ = 0.7. Produce an illustration that looks similar to Figure Write a demonstration program to sample 10 values from a Bernoulli(θ) distribution with θ = 0.3. Note that the Bernoulli distribution is one of the simplest discrete distri-

26 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 25 Listing 3.1: Matlab code to visualize Normal distribution. 1 %% Explore the Normal distribution N( mu, sigma ) 2 mu = 100; % the mean 3 sigma = 15; % the standard deviation 4 xmin = 70; % minimum x value for pdf and cdf plot 5 xmax = 130; % maximum x value for pdf and cdf plot 6 n = 100; % number of points on pdf and cdf plot 7 k = 10000; % number of random draws for histogram 8 9 % create a set of values ranging from xmin to xmax 10 x = linspace( xmin, xmax, n ); 11 p = normpdf( x, mu, sigma ); % calculate the pdf 12 c = normcdf( x, mu, sigma ); % calculate the cdf figure( 1 ); clf; % create a new figure and clear the contents subplot( 1,3,1 ); 17 plot( x, p, 'k ' ); 18 xlabel( 'x' ); ylabel( 'pdf' ); 19 title( 'Probability Density Function' ); subplot( 1,3,2 ); 22 plot( x, c, 'k ' ); 23 xlabel( 'x' ); ylabel( 'cdf' ); 24 title( 'Cumulative Density Function' ); % draw k random numbers from a N( mu, sigma ) distribution 27 y = normrnd( mu, sigma, k, 1 ); subplot( 1,3,3 ); 30 hist( y, 20 ); 31 xlabel( 'x' ); ylabel( 'frequency' ); 32 title( 'Histogram of random values' ); 0.35 Probability Distribution 1 Cumulative Probability Distribution 30 Histogram Probability Cumulative Probability Frequency x x x Figure 3.2: Illustration of the Binomial(N, θ) distribution where N = 10 and θ = 0.7.

27 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 26 butions to simulate. There are only two possible outcomes, 0 and 1. With probability θ, the outcome is 1, and with probability 1 θ, the outcome is 0. In other words, p(x = 1) = θ, and p(x = 0) = 1 θ. This distribution can be used to simulate outcomes in a number of situations, such as head or tail outcomes from a weighted coin, correct/incorrect outcomes from true/false questions, etc. In Matlab, you can simulate the Bernoulli distribution using the binomial distribution with N = 1. However, for the purpose of this exercise, please write the code needed to sample Bernoulli distributed values that does not make use of the built-in binomial distribution. 4. It is often useful in simulations to ensure that each replication of the simulation gives the exact same result. In Matlab, when drawing random values from distributions, the values are different every time you restart the code. There is a simple way to seed the random number generators to insure that they produce the same sequence. Write a Matlab script that samples two sets of 10 random values drawn from a uniform distribution between [0,1]. Use the seeding function between the two sampling steps to demonstrate that the two sets of random values are identical. Your Matlab code could use the following line: seed=1; rng( seed ); 5. Suppose we know from previous research that in a given population, IQ coefficients are Normally distributed with a mean of 100 and a standard deviation of 15. Calculate the probability that a randomly drawn person from this population has an IQ greater than 110 but smaller than 130. You can achieve this using one line of matlab code. What does this look like? ** 6 The Dirichlet distribution is currently not supported by Matlab. Can you find a matlab function, using online resources, that implements the sampling from a Dirichlet distribution? 3.2 Sampling from non-standard distributions Suppose we wish to sample from a distribution that is not one of the standard distributions that is supported by Matlab. In modeling situations, this situation frequently arises, because a researcher can propose new noise processes or combinations of existing distributions. Computational methods for solving complex sampling problems often rely on sampling distributions that we do know how to sample from efficiently. The random values from these simple distributions can then be transformed or compared to the target distribution. In fact, some of the techniques discussed in this section are used by Matlab internally to sample from distributions such as the Normal and Exponential distributions.

28 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES Inverse transform sampling with discrete variables Inverse transform sampling (also known as the inverse transform method) is a method for generating random numbers from any probability distribution given the inverse of its cumulative distribution function. The idea is to sample uniformly distributed random numbers (between 0 and 1) and then transform these values using the inverse cumulative distribution function. The simplicity of this procedure lies in the fact that the underlying sampling is just based on transformed uniform deviates. This procedure can be used to sample many different kinds of distributions. In fact, this is how Matlab implements many of its random number generators. It is easiest to illustrate this approach on a discrete distribution where we know the probability of each individual outcome. In this case, the inverse transform method just requires a simple table lookup. To give an example of a non-standard discrete distribution, we use some data from experiments that have looked at how well humans can produce uniform random numbers (e.g. Treisman and Faulkner, 1987). In these experiments, subjects produce a large number of random digits (0,..,9) and investigators tabulate the relative frequencies of each random digit produced. As you might suspect, subjects do not always produce uniform distributions. Table shows some typical data. Some of the low and the high numbers are underrepresented while some specific digits (e.g. 4) are overrepresented. For some reason, the digits 0 and 9 were never generated by the subject (perhaps because the subject misinterpreted the instructions). In any case, this data is fairly typical and demonstrates that humans are not very good are producing uniformly distributed random numbers. Table 3.2: Probability of digits observed in human random digit generation experiment. The generated digit is represented by X; p(x) and F (X) are the probability mass and cumulative probabilities respectively. The data was estimated from subject 6, session 1, in experiment by Treisman and Faulkner (1987). X p(x) F (X) Suppose we now want to mimic this process and write an algorithm that samples digits

29 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 28 according to the probabilities shown in Table Therefore, the program should produce a 4 with probability.2, a 5 with probability.175, etc. For example, the code in Listing 3.2 implements this process using the built-in matlab function randsample. The code produces the illustration shown in Figure Instead of using the built-in functions such as randsample or mnrnd, it is helpful to consider how to implement the underlying sampling algorithm using the inverse transform method. We first need to calculate the cumulative probability distribution. In other words, we need to know the probability that we observe an outcome equal to or smaller than some particular value. If F (X) represents the cumulative function, we need to calculate F (X = x) = p(x <= x). For discrete distributions, this can be done using simple summation. The cumulative probabilities of our example are shown in the right column of Table In the inverse transform algorithm, the idea is to sample uniform random deviates (i.e., random numbers between 0 and 1) and to compare each random number against the table of cumulative probabilities. The first outcome for which the random deviate is smaller than (or is equal to) the associated cumulative probability corresponds to the sampled outcome. Figure shows an example with a uniform random deviate of U = 0.8 that leads to a sampled outcome X = 6. This process of repeated sampling of uniform deviates and comparing these to the cumulative distribution forms the basis for the inverse transform method for discrete variables. Note that we are applying an inverse function, because we are doing an inverse table lookup. Listing 3.2: Matlab code to simulate sampling of random digits. 1 % Simulate the distribution observed in the 2 % human random digit generation task 3 4 % probabilities for each digit 5 theta = [0.000;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ;... % digit ]... % digit % fix the random number generator 17 seed = 1; rand( 'state', seed ); % let's say we draw K random values 20 K = 10000; 21 digitset = 0:9; 22 Y = randsample(digitset,k,true,theta); 23

30 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES % create a new figure 25 figure( 1 ); clf; % Show the histogram of the simulated draws 28 counts = hist( Y, digitset ); 29 bar( digitset, counts, 'k' ); 30 xlim( [ ] ); 31 xlabel( 'Digit' ); 32 ylabel( 'Frequency' ); 33 title( 'Distribution of simulated draws of human digit generator' ); 2500 Distribution of simulated draws of human digit generator 2000 Frequency Digit Figure 3.3: Illustration of the BINOMIAL(N, θ) distribution where N = 10 and θ = 0.7. Exercises 1. Create the Matlab program that implements the inverse tranform method for discrete variables. Use it to sample random digits with probabilities as shown in Table In order to show that the algorithm is working, sample a large number of random digits and create a histogram. Your program should never sample digits 0 and 9 as they are given zero probability in the table. ** 2 One solution to the previous exercise that does not require any loops is by using the multinomial random number generator mnrnd. Show how to use this function to sample digits according to the probabilities shown in Table

31 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES F(X) X Figure 3.4: Illustration of the inverse transform procedure for generating discrete random variables. Note that we plot the cumulative probabilities for each outcome. If we sample a uniform random number of U = 0.8, then this yields a random value of X = 6 ** 3 Explain why the algorithm as described above might be inefficient when dealing with skewed probability distributions. [hint: imagine a situation where the first N-1 outcomes have zero probability and the last outcome has probability one]. Can you think of a simple change to the algorithm to improve its efficiency? Inverse transform sampling with continuous variables The inverse transform sampling approach can also be applied to continuous distributions. Generally, the idea is to draw uniform random deviates and to apply the inverse function of the cumulative distribution applied to the random deviate. In the following, let F (X) be the cumulative density function (CDF) of our target variable X and F 1 (X) the inverse of this function, assuming that we can actually calculate this inverse. We wish to draw random values for X. This can be done with the following procedure: 1. Draw U Uniform(0, 1) 2. Set X = F 1 (U) 3. Repeat Let s illustrate this approach with a simple example. Suppose we want to sample random numbers from the exponential distribution. When λ > 0, the cumulative density function is F (x λ) = 1 exp( x/λ). Using some simple algebra, one can find the inverse of this function, which is F 1 (u λ) = log(1 u)λ. This leads to the following sampling procedure to sample random numbers from a Exponental(λ) distribution:

32 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES Draw U Uniform(0, 1) 2. Set X = log(1 U)λ 3. Repeat Exercises 1. Implement the inverse transform sampling method for the exponential distribution. Sample a large number of values from this distribution, and show the distribution of these values. Compare the distribution you obtain against the exact distribution as obtained by the PDF of the exponential distribution (use the command exppdf). ** 2 Matlab implements some of its own functions using Matlab code. For example, when you call the exponential random number generator exprnd, Matlab executes a function that is stored in its own internal directories. Please locate the Matlab function exprnd and inspect its contents. How does Matlab implement the sampling from the exponential distribution? Does it use the inverse transform method? Note that the path to this Matlab function will depend on your particular Matlab installation, but it probably looks something like C:\Program Files\MATLAB\R2009B\toolbox\stats\exprnd.m Rejection sampling In many cases, it is not possible to apply the inverse transform sampling method because it is difficult to compute the cumulative distribution or its inverse. In this case, there are other options available, such as rejection sampling, and methods using Markov chain Monte Carlo approaches that we will discuss in the next chapter. The main advantage of the rejection sampling method is that it does not require any burn-in period. Instead, all samples obtained during sampling can immediately be used as samples from the target distribution. One way to illustrate the general idea of rejection sampling (also commonly called the accept-reject algorithm ) is with Figure 3.5. Suppose we wish to draw points uniformly within a circle centered at (0, 0) and with radius 1. At first, it seems quite complicated to directly sample points within this circle in uniform fashion. However, we can apply rejection sampling by first drawing (x, y) values uniformly from within the square surrounding the circle, and rejecting any samples for which x 2 + y 2 > 1. Note that in this procedure, we used a very simple proposal distribution, such as the uniform distribution, as a basis for sampling from a much more complicated distribution. Rejection sampling allows us to generate observations from a distribution that is difficult to sample from but where we can evaluate the probability of any particular sample. In other words, suppose we have a distribution p(θ), and it is difficult to sample from this distribution directly, but we can evaluate the probability density or mass p(θ) for a particular value of θ.

33 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 32 (A) 1 (B) y 0 y x x Figure 3.5: Sampling points uniformly from unit circle using rejection sampling The first choice that the researcher needs to make is the proposal distribution. The proposal distribution is a simple distribution q(θ), that we can directly sample from. The idea is to evaluate the probability of the proposed samples under both the proposal distribution and the target distribution and reject samples that are unlikely under the target distribution relative to the proposal distribution. Figure 3.6 illustrates the procedure. We first need to find a constant c such that cq(θ) p(θ) for all possible samples θ. The proposal function q(θ) multiplied by the constant c is known as the comparison distribution and will always lie on top of our target distribution. Finding the constant c might be non-trivial, but let s assume for now that we can do this using some calculus. We now draw a number u from a uniform distribution between [0, cq(θ)]. In other words, this is some point on the line segment between 0 and the height of the comparison distribution evaluated at our proposal θ. We will reject the proposal if u > p(θ) and accept it otherwise. If we accept the proposal, the sampled value θ is a draw from the target distribution p(θ). Here is a summary of the computational procedure: 1. Choose a density q(θ) that is easy to sample from 2. Find a constant c such that cq(θ) p(θ) for all θ 3. Sample a proposal θ from proposal distribution q(θ) 4. Sample a uniform deviate u from the interval [0, cq(θ)] 5. Reject the proposal if u > p(θ), accept otherwise 6. Repeat steps 3, 4, and 5 until desired number of samples is reached; each accepted sample is a draw from p(θ) The key to an efficient operation of this algorithm is to have as many samples accepted as possible. This depends crucially on the choice of the proposal distribution. A proposal

34 CHAPTER 3. SAMPLING FROM RANDOM VARIABLES 33 cq ( ) p ( ) u Figure 3.6: Illustration of rejection sampling. The particular sample shown in the figure will be rejected distribution that is dissimilar to the target distribution will lead to many rejected samples, slowing the procedure down. Exercises 1. Suppose we want to sample from a Beta(α, β) distribution where α = 2 and β = 1. This gives the probability density p(x) = 2x for 0 < x < 1. Assume that for whatever reason, we do not have access to the Beta random number generator that Matlab provides. Instead, the goal is to implement a rejection sampling algorithm in Matlab that samples from this distribution. For this exercise, use a simple uniform proposal distribution (even though this is not a good choice as a proposal distribution). The constant c should be 2 in this case. Visualize the histogram of sampled values and verify that the distribution matches the histogram obtained by using Matlab s betarnd sampling function. What is the percentage of accepted samples? How might we improve the rejection sampler? ( 2ln(x 2 +y 2 ) x 2 +y 2 ** 2 The procedure shown in Figure 3.5 forms the basis for the Box-Muller method for generating Gaussian distributed random variables. We first generate uniform coordinates (x, y) from the unit circle using the rejection sampling procedure that rejects any (x, y) pair with x 2 + y 2 > 1. Then, for each pair (x, y) we evaluate the quanti- ( ) 1/2 ) ties z 1 = x 2ln(x 2 +y 2 ) 1/2. x 2 +y and 2 z2 = y The values z1 and z 2 are each Gaussian distributed with zero mean and unit variance. Write a Matlab program that implements this Box-Muller method and verify that the sampled values are Gaussian distributed.

35 Chapter 4 Markov Chain Monte Carlo The application of probabilistic models to data often leads to inference problems that require the integration of complex, high dimensional distributions. Markov chain Monte Carlo (MCMC), is a general computational approach that replaces analytic integration by summation over samples generated from iterative algorithms. Problems that are intractable using analytic approaches often become possible to solve using some form of MCMC, even with high-dimensional problems. The development of MCMC is arguably the biggest advance in the computational approach to statistics. While MCMC is very much an active research area, there are now some standardized techniques that are widely used. In this chapter, we will discuss two forms of MCMC: Metropolis-Hastings and Gibbs sampling. Before we go into these techniques though, we first need to understand the two main ideas underlying MCMC: Monte Carlo integration, and Markov chains. 4.1 Monte Carlo integration Many problems in probabilistic inference require the calculation of complex integrals or summations over very large outcome spaces. For example, a frequent problem is to calculate the expectation of a function g(x) for the random variable x (for simplicity, we assume x is a univariate random variable). If x is continuous, the expectation is defined as: E[g(x)] = g(x)p(x)dx (4.1) In the case of discrete variables, the integral is replaced by summation: E[g(x)] = g(x)p(x)dx (4.2) These expectations arise in many situations where we want to calculate some statistic of a distribution, such as the mean or variance. For example, with g(x) = x, we are calculating the mean of a distribution. Integration or summation using analytic techniques can become quite challenging for certain distributions. For example, the density p(x) might have a 34

Chapter 3 RANDOM VARIATE GENERATION

Chapter 3 RANDOM VARIATE GENERATION Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.

More information

6 Scalar, Stochastic, Discrete Dynamic Systems

6 Scalar, Stochastic, Discrete Dynamic Systems 47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables Discrete vs. continuous random variables Examples of continuous distributions o Uniform o Exponential o Normal Recall: A random

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford Financial Econometrics MFE MATLAB Introduction Kevin Sheppard University of Oxford October 21, 2013 2007-2013 Kevin Sheppard 2 Contents Introduction i 1 Getting Started 1 2 Basic Input and Operators 5

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

1 Prior Probability and Posterior Probability

1 Prior Probability and Posterior Probability Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which

More information

Foundation of Quantitative Data Analysis

Foundation of Quantitative Data Analysis Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1

More information

Review of Random Variables

Review of Random Variables Chapter 1 Review of Random Variables Updated: January 16, 2015 This chapter reviews basic probability concepts that are necessary for the modeling and statistical analysis of financial data. 1.1 Random

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Aachen Summer Simulation Seminar 2014

Aachen Summer Simulation Seminar 2014 Aachen Summer Simulation Seminar 2014 Lecture 07 Input Modelling + Experimentation + Output Analysis Peer-Olaf Siebers pos@cs.nott.ac.uk Motivation 1. Input modelling Improve the understanding about how

More information

Centre for Central Banking Studies

Centre for Central Banking Studies Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

More information

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables AMATH 352 Lecture 3 MATLAB Tutorial MATLAB (short for MATrix LABoratory) is a very useful piece of software for numerical analysis. It provides an environment for computation and the visualization. Learning

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i ) Probability Review 15.075 Cynthia Rudin A probability space, defined by Kolmogorov (1903-1987) consists of: A set of outcomes S, e.g., for the roll of a die, S = {1, 2, 3, 4, 5, 6}, 1 1 2 1 6 for the roll

More information

ECE302 Spring 2006 HW5 Solutions February 21, 2006 1

ECE302 Spring 2006 HW5 Solutions February 21, 2006 1 ECE3 Spring 6 HW5 Solutions February 1, 6 1 Solutions to HW5 Note: Most of these solutions were generated by R. D. Yates and D. J. Goodman, the authors of our textbook. I have added comments in italics

More information

Principle of Data Reduction

Principle of Data Reduction Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

More information

Lab 11. Simulations. The Concept

Lab 11. Simulations. The Concept Lab 11 Simulations In this lab you ll learn how to create simulations to provide approximate answers to probability questions. We ll make use of a particular kind of structure, called a box model, that

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

TEST 2 STUDY GUIDE. 1. Consider the data shown below.

TEST 2 STUDY GUIDE. 1. Consider the data shown below. 2006 by The Arizona Board of Regents for The University of Arizona All rights reserved Business Mathematics I TEST 2 STUDY GUIDE 1 Consider the data shown below (a) Fill in the Frequency and Relative Frequency

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?

More information

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution

Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution Lecture 3: Continuous distributions, expected value & mean, variance, the normal distribution 8 October 2007 In this lecture we ll learn the following: 1. how continuous probability distributions differ

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Characteristics of Binomial Distributions

Characteristics of Binomial Distributions Lesson2 Characteristics of Binomial Distributions In the last lesson, you constructed several binomial distributions, observed their shapes, and estimated their means and standard deviations. In Investigation

More information

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; hflores@rice.edu, and Dr. J.A. Dobelman

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; hflores@rice.edu, and Dr. J.A. Dobelman Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; hflores@rice.edu, and Dr. J.A. Dobelman Statistics lab will be mainly focused on applying what you have learned in class with

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour. Patrick Lam Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

More information

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

More information

2WB05 Simulation Lecture 8: Generating random variables

2WB05 Simulation Lecture 8: Generating random variables 2WB05 Simulation Lecture 8: Generating random variables Marko Boon http://www.win.tue.nl/courses/2wb05 January 7, 2013 Outline 2/36 1. How do we generate random variables? 2. Fitting distributions Generating

More information

Point Biserial Correlation Tests

Point Biserial Correlation Tests Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable

More information

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER seven Statistical Analysis with Excel CHAPTER chapter OVERVIEW 7.1 Introduction 7.2 Understanding Data 7.3 Relationships in Data 7.4 Distributions 7.5 Summary 7.6 Exercises 147 148 CHAPTER 7 Statistical

More information

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA Csilla Csendes University of Miskolc, Hungary Department of Applied Mathematics ICAM 2010 Probability density functions A random variable X has density

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

Analysis of System Performance IN2072 Chapter M Matlab Tutorial

Analysis of System Performance IN2072 Chapter M Matlab Tutorial Chair for Network Architectures and Services Prof. Carle Department of Computer Science TU München Analysis of System Performance IN2072 Chapter M Matlab Tutorial Dr. Alexander Klein Prof. Dr.-Ing. Georg

More information

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab Monte Carlo Simulation: IEOR E4703 Fall 2004 c 2004 by Martin Haugh Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab 1 Overview of Monte Carlo Simulation 1.1 Why use simulation?

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

Important Probability Distributions OPRE 6301

Important Probability Distributions OPRE 6301 Important Probability Distributions OPRE 6301 Important Distributions... Certain probability distributions occur with such regularity in real-life applications that they have been given their own names.

More information

Chi Square Tests. Chapter 10. 10.1 Introduction

Chi Square Tests. Chapter 10. 10.1 Introduction Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square

More information

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction CA200 Quantitative Analysis for Business Decisions File name: CA200_Section_04A_StatisticsIntroduction Table of Contents 4. Introduction to Statistics... 1 4.1 Overview... 3 4.2 Discrete or continuous

More information

YASAIw.xla A modified version of an open source add in for Excel to provide additional functions for Monte Carlo simulation.

YASAIw.xla A modified version of an open source add in for Excel to provide additional functions for Monte Carlo simulation. YASAIw.xla A modified version of an open source add in for Excel to provide additional functions for Monte Carlo simulation. By Greg Pelletier, Department of Ecology, P.O. Box 47710, Olympia, WA 98504

More information

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking 174 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 2, FEBRUARY 2002 A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking M. Sanjeev Arulampalam, Simon Maskell, Neil

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Chapter 9 Monté Carlo Simulation

Chapter 9 Monté Carlo Simulation MGS 3100 Business Analysis Chapter 9 Monté Carlo What Is? A model/process used to duplicate or mimic the real system Types of Models Physical simulation Computer simulation When to Use (Computer) Models?

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan. 15.450, Fall 2010

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan. 15.450, Fall 2010 Simulation Methods Leonid Kogan MIT, Sloan 15.450, Fall 2010 c Leonid Kogan ( MIT, Sloan ) Simulation Methods 15.450, Fall 2010 1 / 35 Outline 1 Generating Random Numbers 2 Variance Reduction 3 Quasi-Monte

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

Data Visualization. Christopher Simpkins chris.simpkins@gatech.edu

Data Visualization. Christopher Simpkins chris.simpkins@gatech.edu Data Visualization Christopher Simpkins chris.simpkins@gatech.edu Data Visualization Data visualization is an activity in the exploratory data analysis process in which we try to figure out what story

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical

More information

Finite Mathematics Using Microsoft Excel

Finite Mathematics Using Microsoft Excel Overview and examples from Finite Mathematics Using Microsoft Excel Revathi Narasimhan Saint Peter's College An electronic supplement to Finite Mathematics and Its Applications, 6th Ed., by Goldstein,

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

More details on the inputs, functionality, and output can be found below.

More details on the inputs, functionality, and output can be found below. Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Solutions to Exam in Speech Signal Processing EN2300

Solutions to Exam in Speech Signal Processing EN2300 Solutions to Exam in Speech Signal Processing EN23 Date: Thursday, Dec 2, 8: 3: Place: Allowed: Grades: Language: Solutions: Q34, Q36 Beta Math Handbook (or corresponding), calculator with empty memory.

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information

General Sampling Methods

General Sampling Methods General Sampling Methods Reference: Glasserman, 2.2 and 2.3 Claudio Pacati academic year 2016 17 1 Inverse Transform Method Assume U U(0, 1) and let F be the cumulative distribution function of a distribution

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

The Image Deblurring Problem

The Image Deblurring Problem page 1 Chapter 1 The Image Deblurring Problem You cannot depend on your eyes when your imagination is out of focus. Mark Twain When we use a camera, we want the recorded image to be a faithful representation

More information

Beginner s Matlab Tutorial

Beginner s Matlab Tutorial Christopher Lum lum@u.washington.edu Introduction Beginner s Matlab Tutorial This document is designed to act as a tutorial for an individual who has had no prior experience with Matlab. For any questions

More information

Continuous Random Variables

Continuous Random Variables Chapter 5 Continuous Random Variables 5.1 Continuous Random Variables 1 5.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize and understand continuous

More information

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences. 1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis

More information

Math 58. Rumbos Fall 2008 1. Solutions to Review Problems for Exam 2

Math 58. Rumbos Fall 2008 1. Solutions to Review Problems for Exam 2 Math 58. Rumbos Fall 2008 1 Solutions to Review Problems for Exam 2 1. For each of the following scenarios, determine whether the binomial distribution is the appropriate distribution for the random variable

More information

Week 1. Exploratory Data Analysis

Week 1. Exploratory Data Analysis Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam

More information

Petrel TIPS&TRICKS from SCM

Petrel TIPS&TRICKS from SCM Petrel TIPS&TRICKS from SCM Knowledge Worth Sharing Histograms and SGS Modeling Histograms are used daily for interpretation, quality control, and modeling in Petrel. This TIPS&TRICKS document briefly

More information

How To Test For Significance On A Data Set

How To Test For Significance On A Data Set Non-Parametric Univariate Tests: 1 Sample Sign Test 1 1 SAMPLE SIGN TEST A non-parametric equivalent of the 1 SAMPLE T-TEST. ASSUMPTIONS: Data is non-normally distributed, even after log transforming.

More information