Bayesian Dark Knowledge
|
|
|
- Norman Park
- 9 years ago
- Views:
Transcription
1 Bayesian Dark Knowledge Anoop Korattikara, Vivek Rathod, Kevin Murphy Google Research {kbanoop, rathodv, Max Welling University of Amsterdam Abstract We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities p(y x, D), e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time). We describe a method for distilling a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [HLA15] and an approach based on variational Bayes [BCKW15]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time. 1 Introduction Deep neural networks (DNNs) have recently been achieving state of the art results in many fields. However, their predictions are often over confident, which is a problem in applications such as active learning, reinforcement learning (including bandits), and classifier fusion, which all rely on good estimates of uncertainty. A principled way to tackle this problem is to use Bayesian inference. Specifically, we first compute the posterior distribution over the model parameters, p(θ D N ) p(θ) N i=1 p(y i x i, θ), where D N = {(x i, y i )} N i=1, x i X D is the i th input (where D is the number of features), and y i Y is the i th output. Then we compute the posterior predictive distribution, p(y x, D N ) = p(y x, θ)p(θ DN )dθ, for each test point x. For reasons of computational speed, it is common to approximate the posterior distribution by a point estimate such as the MAP estimate, ˆθ = argmax p(θ D N ). When N is large, we often use stochastic gradient descent (SGD) to compute ˆθ. Finally, we make a plug-in approximation to the predictive distribution: p(y x, D N ) p(y x, ˆθ). Unfortunately, this loses most of the benefits of the Bayesian approach, since uncertainty in the parameters (which induces uncertainty in the predictions) is ignored. Various ways of more accurately approximating p(θ D N ) (and hence p(y x, D N )) have been developed. Recently, [HLA15] proposed a method called probabilistic backpropagation (PBP) based on an online version of expectation propagation (EP), (i.e., using repeated assumed density filtering (ADF)), where the posterior is approximated as a product of univariate Gaussians, one per parameter: p(θ D N ) q(θ) i N (θ i m i, v i ). An alternative to EP is variational Bayes (VB) where we optimize a lower bound on the marginal likelihood. [Gra11] presented a (biased) Monte Carlo estimate of this lower bound and applies 1
2 his method, called variational inference (VI), to infer the neural network weights. More recently, [BCKW15] proposed an approach called Bayes by Backprop (BBB), which extends the VI method with an unbiased MC estimate of the lower bound based on the reparameterization trick of [KW14, RMW14]. In both [Gra11] and [BCKW15], the posterior is approximated by a product of univariate Gaussians. Although EP and VB scale well with data size (since they use online learning), there are several problems with these methods: (1) they can give poor approximations when the posterior p(θ D N ) does not factorize, or if it has multi-modality or skew; (2) at test time, computing the predictive density p(y x, D N ) can be much slower than using the plug-in approximation, because of the need to integrate out the parameters; (3) they need to use double the memory of a standard plug-in method (to store the mean and variance of each parameter), which can be problematic in memory-limited settings such as mobile phones; (4) they can be quite complicated to derive and implement. A common alternative to EP and VB is to use MCMC methods to approximate p(θ D N ). Traditional MCMC methods are batch algorithms, that scale poorly with dataset size. However, recently a method called stochastic gradient Langevin dynamics (SGLD) [WT11] has been devised that can draw samples approximately from the posterior in an online fashion, just as SGD updates a point estimate of the parameters online. Furthermore, various extensions of SGLD have been proposed, including stochastic gradient hybrid Monte Carlo (SGHMC) [CFG14], stochastic gradient Nosé-Hoover Thermostat (SG-NHT) [DFB + 14] (which improves upon SGHMC), stochastic gradient Fisher scoring (SGFS) [AKW12] (which uses second order information), stochastic gradient Riemannian Langevin Dynamics [PT13], distributed SGLD [ASW14], etc. However, in this paper, we will just use vanilla SGLD [WT11]. 1 All these MCMC methods (whether batch or online) produce a Monte Carlo approximation to the posterior, q(θ) = 1 S S s=1 δ(θ θs ), where S is the number of samples. Such an approximation can be more accurate than that produced by EP or VB, and the method is much easier to implement (for SGLD, you essentially just add Gaussian noise to your SGD updates). However, at test time, things are S times slower than using a plug-in estimate, since we need to compute q(y x) = 1 S S s=1 p(y x, θs ), and the memory requirements are S times bigger, since we need to store the θ s. (For our largest experiment, our DNN has 500k parameters, so we can only afford to store a single sample.) In this paper, we propose to train a parametric model S(y x, w) to approximate the Monte Carlo posterior predictive distribution q(y x) in order to gain the benefits of the Bayesian approach while only using the same run time cost as the plugin method. Following [HVD14], we call q(y x) the teacher and S(y x, w) the student. We use SGLD 2 to estimate q(θ) and hence q(y x) online; we simultaneously train the student online to minimize KL(q(y x) S(y x, w)). We give the details in Section 2. Similar ideas have been proposed in the past. In particular, [SG05] also trained a parametric student model to approximate a Monte Carlo teacher. However, they used batch training and they used mixture models for the student. By contrast, we use online training (and can thus handle larger datasets), and use deep neural networks for the student. [HVD14] also trained a student neural network to emulate the predictions of a (larger) teacher network (a process they call distillation ), extending earlier work of [BCNM06] which approximated an ensemble of classifiers by a single one. The key difference from our work is that our teacher is generated using MCMC, and our goal is not just to improve classification accuracy, but also to get reliable probabilistic predictions, especially away from the training data. [HVD14] coined the term dark knowledge to represent the information which is hidden inside the teacher network, and which can then be distilled into the student. We therefore call our approach Bayesian dark knowledge. 1 We did some preliminary experiments with SG-NHT for fitting an MLP to MNIST data, but the results were not much better than SGLD. 2 Note that SGLD is an approximate sampling algorithm and introduces a slight bias in the predictions of the teacher and student network. If required, we can replace SGLD with an exact MCMC method (e.g. HMC) to get more accurate results at the expense of more training time. 2
3 In summary, our contributions are as follows. First, we show how to combine online MCMC methods with model distillation in order to get a simple, scalable approach to Bayesian inference of the parameters of neural networks (and other kinds of models). Second, we show that our probabilistic predictions lead to improved log likelihood scores on the test set compared to SGD and the recently proposed EP and VB approaches. 2 Methods Our goal is to train a student neural network (SNN) to approximate the Bayesian predictive distribution of the teacher, which is a Monte Carlo ensemble of teacher neural networks (TNN). If we denote the predictions of the teacher by p(y x, D N ) and the parameters of the student network by w, our objective becomes L(w x) = KL(p(y x, D N ) S(y x, w)) = E p(y x,dn ) log S(y x, w) + const [ ] = p(y x, θ)p(θ D N )dθ log S(y x, w)dy = p(θ D N ) p(y x, θ) log S(y x, w)dy dθ = p(θ D N ) [ E p(y x,θ) log S(y x, w) ] dθ (1) Unfortunately, computing this integral is not analytically tractable. However, we can approximate this by Monte Carlo: ˆL(w x) = 1 E p(y x,θs ) log S(y x, w) (2) Θ where Θ is a set of samples from p(θ D N ). θ s Θ To make this a function just of w, we need to integrate out x. For this, we need a dataset to train the student network on, which we will denote by D. Note that points in this dataset do not need ground truth labels; instead the labels (which will be probability distributions) will be provided by the teacher. The choice of student data controls the domain over which the student will make accurate predictions. For low dimensional problems (such as in Section 3.1), we can uniformly sample the input domain. For higher dimensional problems, we can sample near the training data, for example by perturbing the inputs slightly. In any case, we will compute a Monte Carlo approximation to the loss as follows: ˆL(w) = p(x)l(w x)dx 1 D L(w x ) x D 1 1 Θ D E p(y x,θ s ) log S(y x, w) (3) x D θ s Θ It can take a lot of memory to pre-compute and store the set of parameter samples Θ and the set of data samples D, so in practice we use the stochastic algorithm shown in Algorithm 1, which uses a single posterior sample θ s and a minibatch of x at each step. The hyper-parameters λ and γ from Algorithm 1 control the strength of the priors for the teacher and student networks. We use simple spherical Gaussian priors (equivalent to L 2 regularization); we set the precision (strength) of these Gaussian priors by cross-validation. Typically λ γ, since the student gets to see more data than the teacher. This is true for two reasons: first, the teacher is trained to predict a single label per input, whereas the student is trained to predict a distribution, which contains more information (as argued in [HVD14]); second, the teacher makes multiple passes over the same training data, whereas the student sees fresh randomly generated data D at each step. 2.1 Classification For classification problems, each teacher network θ s models the observations using a standard softmax model, p(y = k x, θ s ). We want to approximate this using a student network, which also has a 3
4 Algorithm 1: Distilled SGLD Input: D N = {(x i, y i )} N i=1, minibatch size M, number of iterations T, teacher learning schedule η t, student learning schedule ρ t, teacher prior λ, student prior γ for t = 1 : T do // Train teacher (SGLD step) Sample minibatch indices S [1, N] of size M Sample z t N (0, η t I) ( Update θ t+1 := θ t + ηt 2 θ log p(θ λ) + N M i S θ log p(y i x i, θ) ) + z t // Train student (SGD step) Sample D of size ( M from student data generator 1 w t+1 := w t ρ t x D ˆL(w, ) w θ t+1 x ) + γw t M softmax output, S(y = k x, w). Hence from Eqn. 2, our loss function estimate is the standard cross entropy loss: ˆL(w θ s, x) = K p(y = k x, θ s ) log S(y = k x, w) (4) k=1 The student network outputs β k (x, w) = log S(y = k x, w). To estimate the gradient w.r.t. w, we just have to compute the gradients w.r.t. β and back-propagate through the network. These gradients are given by ˆL(w,θ s x) β k (x,w) = p(y = k x, θ s ). 2.2 Regression In regression, the observations are modeled as p(y i x i, θ) = N (y i f(x i θ), λ 1 n ) where f(x θ) is the prediction of the TNN and λ n is the noise precision. We want to approximate the predictive distribution as p(y x, D N ) S(y x, w) = N (y µ(x, w), e α(x,w) ). We will train a student network to output the parameters of the approximating distribution µ(x, w) and α(x, w); note that this is twice the number of outputs of the teacher network, since we want to capture the (data dependent) variance. 3 We use e α(x,w) instead of directly predicting the variance σ 2 (x w) to avoid dealing with positivity constraints during training. To train the SNN, we will minimize the objective defined in Eqn. 2: ˆL(w θ s, x) = E p(y x,θ s ) log N (y µ(x, w), e α(x,w) ) = 1 [ ] 2 E p(y x,θ s ) α(x, w) + e α(x,w) (y µ(x, w) 2 ) = 1 { [α(x, w) + e α(x,w) (f(x θ s ) µ(x, w)) }] 2 λ n Now, to estimate w ˆL(w, θ s x), we just have to compute through the network. These gradients are: ˆL(w, θ s x) µ(x, w) ˆL(w, θ s x α(x, w) 3 Experimental results ˆL µ(x,w) and ˆL α(x,w), and back propagate = e α(x,w) {µ(x, w) f(x θ s )} (5) = 1 2 { [1 e α(x,w) (f(x θ s ) µ(x, w)) }] λ n In this section, we compare SGLD and distilled SGLD with other approximate inference methods, including the plugin approximation using SGD, the PBP approach of [HLA15], the BBB approach of 3 This is not necessary in the classification case, since the softmax distribution already captures uncertainty. (6) 4
5 Dataset N D Y PBP BBB HMC ToyClass 20 2 {0, 1} N N Y MNIST 60k 784 {0,..., 9} N Y N ToyReg 10 1 R Y N Y Boston Housing R Y N N Table 1: Summary of our experimental configurations. (a) (b) (c) (d) (e) (f) Figure 1: Posterior predictive density for various methods on the toy 2d dataset. (a) SGD (plugin) using the network. (b) HMC using 20k samples. (c) SGLD using 1k samples. (d-f) Distilled SGLD using a student network with the following architectures: , and [BCKW15], and Hamiltonian Monte Carlo (HMC) [Nea11], which is considered the gold standard for MCMC for neural nets. We implemented SGD and SGLD using the Torch library (torch.ch). For HMC, we used Stan (mc-stan.org). We perform this comparison for various classification and regression problems, as summarized in Table Toy 2d classification problem We start with a toy 2d binary classification problem, in order to visually illustrate the performance of different methods. We generate a synthetic dataset in 2 dimensions with 2 classes, 10 points per class. We then fit a multi layer perceptron (MLP) with one hidden layer of 10 ReLu units and 2 softmax outputs (denoted ) using SGD. The resulting predictions are shown in Figure 1(a). We see the expected sigmoidal probability ramp orthogonal to the linear decision boundary. Unfortunately, this method predicts a label of 0 or 1 with very high confidence, even for points that are far from the training data (e.g., in the top left and bottom right corners). In Figure 1(b), we show the result of HMC using 20k samples. This is the true posterior predictive density which we wish to approximate. In Figure 1(c), we show the result of SGLD using about 1000 samples. Specifically, we generate 100k samples, discard the first 2k for burnin, and then keep every 100 th sample. We see that this is a good approximation to the HMC distribution. In Figures 1(d-f), we show the results of approximating the SGLD Monte Carlo predictive distribution with a single student MLP of various sizes. To train this student network, we sampled points at random from the domain of the input, [ 10, 10] [ 10, 10]; this encourages the student to predict accurately at all locations, including those far from the training data. In (d), the student has the same 4 Ideally, we would apply all methods to all datasets, to enable a proper comparison. Unfortunately, this was not possible, for various reasons. First, the open source code for the EP approach only supports regression, so we could not evaluate this on classification problems. Second, we were not able to run the BBB code, so we just quote performance numbers from their paper [BCKW15]. Third, HMC is too slow to run on large problems, so we just applied it to the small toy problems. Nevertheless, our experiments show that our methods compare favorably to these other methods. 5
6 Model Num. params. KL SGD SGLD 40k Distilled Distilled Distilled Table 2: KL divergence on the 2d classification dataset. SGD [BCKW15] Dropout BBB SGD (our impl.) SGLD Dist. SGLD ± ± ± Table 3: Test set misclassification rate on MNIST for different methods using a MLP. SGD (first column), Dropout and BBB numbers are quoted from [BCKW15]. For our implmentation of SGD (fourth column), SGLD and distilled SGLD, we report the mean misclassification rate over 10 runs and its standard error. size as the teacher (2-10-2), but this is too simple a model to capture the complexity of the predictive distribution (which is an average over models). In (e), the student has a larger hidden layer ( ); this works better. However, we get best results using a two hidden layer model ( ), as shown in (f). In Table 2, we show the KL divergence between the HMC distribution (which we consider as ground truth) and the various approximations mentioned above. We computed this by comparing the probability distributions pointwise on a 2d grid. The numbers match the qualitative results shown in Figure MNIST classification Now we consider the MNIST digit classification problem, which has N = 60k examples, 10 classes, and D = 784 features. The only preprocessing we do is divide the pixel values by 126 (as in [BCKW15]). We train only on 50K datapoints and use the remaining 10K for tuning hyperparameters. This means our results are not strictly comparable to a lot of published work, which uses the whole dataset for training; however, the difference is likely to be small. Following [BCKW15], we use an MLP with 2 hidden layers with 400 hidden units per layer, ReLU activations, and softmax outputs; we denote this by This model has 500k parameters. We first fit this model by SGD, using these hyper parameters: fixed learning rate of η t = , prior precision λ = 1, minibatch size M = 100, number of iterations T = 1M. As shown in Table 3, our final error rate on the test set is 1.536%, which is a bit lower than the SGD number reported in [BCKW15], perhaps due to the slightly different training/ validation configuration. Next we fit this model by SGLD, using these hyper parameters: fixed learning rate of η t = , thinning interval τ = 100, burn in iterations B = 1000, prior precision λ = 1, minibatch size M = 100. As shown in Table 3, our final error rate on the test set is about 1.271%, which is better than the SGD, dropout and BBB results from [BCKW15]. 5 Finally, we consider using distillation, where the teacher is an SGLD MC approximation of the posterior predictive. We use the same architecture for the student as well as the teacher. We generate data for the student by adding Gaussian noise (with standard deviation of 0.001) to randomly sampled training points 6 We use a constant learning rate of ρ = 0.005, a batch size of M = 100, a prior precision of (for the student) and train for T = 1M iterations. We obtain a test error of 1.307% which is very close to that obtained with SGLD (see Table 4). 5 We only show the BBB results with the same Gaussian prior that we use. Performance of BBB can be improved using other priors, such as a scale mixture of Gaussians, as shown in [BCKW15]. Our approach could probably also benefit from such a prior, but we did not try this. 6 In the future, we would like to consider more sophisticated data perturbations, such as elastic distortions. 6
7 SGD SGLD Distilled SGLD ± ± ± Table 4: Log likelihood per test example on MNIST. We report the mean over 10 trials ± one standard error. Method Avg. test log likelihood PBP (as reported in [HLA15]) ± VI (as reported in [HLA15]) ± SGD ± SGLD ± SGLD distilled ± Table 5: Log likelihood per test example on the Boston housing dataset. We report the mean over 20 trials ± one standard error. We also report the average test log-likelihood of SGD, SGLD and distilled SGLD in Table 4. The log-likelihood is equivalent to the logarithmic scoring rule [Bic07] used in assessing the calibration of probabilistic models. The logarithmic rule is a strictly proper scoring rule, meaning that the score is uniquely maximized by predicting the true probabilities. From Table 4, we see that both SGLD and distilled SGLD acheive higher scores than SGD, and therefore produce better calibrated predictions. Note that the SGLD results were obtained by averaging predictions from 10,000 models sampled from the posterior, whereas distillation produces a single neural network that approximates the average prediction of these models, i.e. distillation reduces both storage and test time costs of SGLD by a factor of 10,000, without sacrificing much accuracy. In terms of training time, SGD took 1.3 ms, SGLD took 1.6 ms and distilled SGLD took 3.2 ms per iteration. In terms of memory, distilled SGLD requires only twice as much as SGD or SGLD during training, and the same as SGD during testing. 3.3 Toy 1d regression We start with a toy 1d regression problem, in order to visually illustrate the performance of different methods. We use the same data and model as [HLA15]. In particular, we use N = 20 points in D = 1 dimensions, sampled from the function y = x 3 + ɛ n, where ɛ n N (0, 9). We fit this data with an MLP with 10 hidden units and ReLU activations. For SGLD, we use S = 2000 samples. For distillation, the teacher uses the same architecture as the student. The results are shown in Figure 2. We see that SGLD is a better approximation to the true (HMC) posterior predictive density than the plugin SGD approximation (which has no predictive uncertainty), and the VI approximation of [Gra11]. Finally, we see that distilling SGLD incurs little loss in accuracy, but saves a lot computationally. 3.4 Boston housing Finally, we consider a larger regression problem, namely the Boston housing dataset, which was also used in [HLA15]. This has N = 506 data points (456 training, 50 testing), with D = 13 dimensions. Since this data set is so small, we repeated all experiments 20 times, using different train/ test splits. Following [HLA15], we use an MLP with 1 layer of 50 hidden units and ReLU activations. First we use SGD, with these hyper parameters 7 : Minibatch size M = 1, noise precision λ n = 1.25, prior precision λ = 1, number of trials 20, constant learning rate η t = 1e 6, number of iterations T = 170K. As shown in Table 5, we get an average log likelihood of Next we fit the model using SGLD. We use an initial learning rate of η 0 = 1e 5, which we reduce by a factor of 0.5 every 80K iterations; we use 500K iterations, a burnin of 10K, and a thinning 7 We choose all hyper-parameters using cross-validation whereas [HLA15] performs posterior inference on the noise and prior precisions, and uses Bayesian optimization to choose the remaining hyper-parameters. 7
8 Figure 2: Predictive distribution for different methods on a toy 1d regression problem. (a) PBP of [HLA15]. (b) HMC. (c) VI method of [Gra11]. (d) SGD. (e) SGLD. (f) Distilled SGLD. Error bars denote 3 standard deviations. (Figures a-d kindly provided by the authors of [HLA15]. We replace their term BP (backprop) with SGD to avoid confusion.) interval of 10. As shown in Table 5, we get an average log likelihood of 2.306, which is better than SGD. Finally, we distill our SGLD model. The student architecture is the same as the teacher. We use the following teacher hyper parameters: prior precision λ = 2.5; initial learning rate of η 0 = 1e 5, which we reduce by a factor of 0.5 every 80K iterations. For the student, we use generated training data with Gaussian noise with standard deviation 0.05, we use a prior precision of γ = 0.001, an initial learning rate of ρ 0 = 1e 2, which we reduce by 0.8 after every 5e3 iterations. As shown in Table 5, we get an average log likelihood of 2.350, which is only slightly worse than SGLD, and much better than SGD. Furthermore, both SGLD and distilled SGLD are better than the PBP method of [HLA15] and the VI method of [Gra11]. 4 Conclusions and future work We have shown a very simple method for being Bayesian about neural networks (and other kinds of models), that seems to work better than recently proposed alternatives based on EP [HLA15] and VB [Gra11, BCKW15]. There are various things we would like to do in the future: (1) Show the utility of our model in an end-to-end task, where predictive uncertainty is useful (such as with contextual bandits or active learning). (2) Consider ways to reduce the variance of the algorithm, perhaps by keeping a running minibatch of parameters uniformly sampled from the posterior, which can be done online using reservoir sampling. (3) Exploring more intelligent data generation methods for training the student. (4) Investigating if our method is able to reduce the prevalence of confident false predictions on adversarially generated examples, such as those discussed in [SZS + 14]. Acknowledgements We thank José Miguel Hernández-Lobato, Julien Cornebise, Jonathan Huang, George Papandreou, Sergio Guadarrama and Nick Johnston. 8
9 References [AKW12] [ASW14] S. Ahn, A. Korattikara, and M. Welling. Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring. In ICML, Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient MCMC. In ICML, [BCKW15] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. In ICML, [BCNM06] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, [Bic07] [CFG14] [DFB + 14] [GG15] J Eric Bickel. Some comparisons among quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4(2):49 65, Tianqi Chen, Emily B Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo. In ICML, N Ding, Y Fang, R Babbush, C Chen, R Skeel, and H Neven. Bayesian sampling using stochastic gradient thermostats. In NIPS, Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. 6 June [Gra11] Alex Graves. Practical variational inference for neural networks. In NIPS, [HLA15] J. Hernández-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In ICML, [HVD14] [KW14] [Nea11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning Workshop, Diederik P Kingma and Max Welling. Stochastic gradient VB and the variational auto-encoder. In ICLR, Radford Neal. MCMC using hamiltonian dynamics. In Handbook of Markov chain Monte Carlo. Chapman and Hall, [PT13] Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In NIPS, [RBK + 14] [RMW14] [SG05] [SZS + 14] [WT11] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. Arxiv, D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, Edward Snelson and Zoubin Ghahramani. Compact approximations to bayesian predictive distributions. In ICML, Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML,
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Compacting ConvNets for end to end Learning
Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli. Success of CNN Image Classification Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton,
Manifold Learning with Variational Auto-encoder for Medical Image Analysis
Manifold Learning with Variational Auto-encoder for Medical Image Analysis Eunbyung Park Department of Computer Science University of North Carolina at Chapel Hill [email protected] Abstract Manifold
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Gaussian Processes to Speed up Hamiltonian Monte Carlo
Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
Section 5. Stan for Big Data. Bob Carpenter. Columbia University
Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Tutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Implementation of Neural Networks with Theano. http://deeplearning.net/tutorial/
Implementation of Neural Networks with Theano http://deeplearning.net/tutorial/ Feed Forward Neural Network (MLP) Hidden Layer Object Hidden Layer Object Hidden Layer Object Logistic Regression Object
Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation
Neural Networks for Machine Learning Lecture 13a The ups and downs of backpropagation Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed A brief history of backpropagation
These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Computational Statistics for Big Data
Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount
Model-based Synthesis. Tony O Hagan
Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that
Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget
Anoop Korattikara [email protected] School of Information & Computer Sciences, University of California, Irvine, CA 92617, USA Yutian Chen [email protected] Department of Engineering, University of
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Towards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015
CS 1699: Intro to Computer Vision Deep Learning Prof. Adriana Kovashka University of Pittsburgh December 1, 2015 Today: Deep neural networks Background Architectures and basic operations Applications Visualizing
Dirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
Introduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Deep Learning Barnabás Póczos & Aarti Singh Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio Geoffrey
Java Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
3F3: Signal and Pattern Processing
3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani [email protected] Department of Engineering University of Cambridge Lent Term Classification We will represent data by
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More details on the inputs, functionality, and output can be found below.
Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Monotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: [email protected] Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Feature Engineering in Machine Learning
Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network
Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)
Lecture 6: Classification & Localization. boris. [email protected]
Lecture 6: Classification & Localization boris. [email protected] 1 Agenda ILSVRC 2014 Overfeat: integrated classification, localization, and detection Classification with Localization Detection. 2 ILSVRC-2014
Statistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
Image and Video Understanding
Image and Video Understanding 2VO 710.095 WS Christoph Feichtenhofer, Axel Pinz Slide credits: Many thanks to all the great computer vision researchers on which this presentation relies on. Most material
Parallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
Pedestrian Detection with RCNN
Pedestrian Detection with RCNN Matthew Chen Department of Computer Science Stanford University [email protected] Abstract In this paper we evaluate the effectiveness of using a Region-based Convolutional
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks
Stochastic Pooling for Regularization of Deep Convolutional Neural Networks Matthew D. Zeiler Department of Computer Science Courant Institute, New York University [email protected] Rob Fergus Department
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
Learning to Process Natural Language in Big Data Environment
CCF ADL 2015 Nanchang Oct 11, 2015 Learning to Process Natural Language in Big Data Environment Hang Li Noah s Ark Lab Huawei Technologies Part 1: Deep Learning - Present and Future Talk Outline Overview
Introduction to Deep Learning Variational Inference, Mean Field Theory
Introduction to Deep Learning Variational Inference, Mean Field Theory 1 Iasonas Kokkinos [email protected] Center for Visual Computing Ecole Centrale Paris Galen Group INRIA-Saclay Lecture 3: recap
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
Bayesian Factorization Machines
Bayesian Factorization Machines Christoph Freudenthaler, Lars Schmidt-Thieme Information Systems & Machine Learning Lab University of Hildesheim 31141 Hildesheim {freudenthaler, schmidt-thieme}@ismll.de
Question 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Efficient online learning of a non-negative sparse autoencoder
and Machine Learning. Bruges (Belgium), 28-30 April 2010, d-side publi., ISBN 2-93030-10-2. Efficient online learning of a non-negative sparse autoencoder Andre Lemme, R. Felix Reinhart and Jochen J. Steil
Temporal Difference Learning in the Tetris Game
Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup
Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Principled Hybrids of Generative and Discriminative Models
Principled Hybrids of Generative and Discriminative Models Julia A. Lasserre University of Cambridge Cambridge, UK [email protected] Christopher M. Bishop Microsoft Research Cambridge, UK [email protected]
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
11. Time series and dynamic linear models
11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd
Back Propagation Neural Networks User Manual
Back Propagation Neural Networks User Manual Author: Lukáš Civín Library: BP_network.dll Runnable class: NeuralNetStart Document: Back Propagation Neural Networks Page 1/28 Content: 1 INTRODUCTION TO BACK-PROPAGATION
Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking
Proceedings of the IEEE ITSC 2006 2006 IEEE Intelligent Transportation Systems Conference Toronto, Canada, September 17-20, 2006 WA4.1 Deterministic Sampling-based Switching Kalman Filtering for Vehicle
Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
Visualizing Higher-Layer Features of a Deep Network
Visualizing Higher-Layer Features of a Deep Network Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent Dept. IRO, Université de Montréal P.O. Box 6128, Downtown Branch, Montreal, H3C 3J7,
Markov Chain Monte Carlo Simulation Made Simple
Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...
IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou
Latent variable and deep modeling with Gaussian processes; application to system identification Andreas Damianou Department of Computer Science, University of Sheffield, UK Brown University, 17 Feb. 2016
Gaussian Process Training with Input Noise
Gaussian Process Training with Input Noise Andrew McHutchon Department of Engineering Cambridge University Cambridge, CB PZ [email protected] Carl Edward Rasmussen Department of Engineering Cambridge University
Big Data, Statistics, and the Internet
Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Novelty Detection in image recognition using IRF Neural Networks properties
Novelty Detection in image recognition using IRF Neural Networks properties Philippe Smagghe, Jean-Luc Buessler, Jean-Philippe Urban Université de Haute-Alsace MIPS 4, rue des Frères Lumière, 68093 Mulhouse,
Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure
Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission
Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection
CSED703R: Deep Learning for Visual Recognition (206S) Lecture 6: CNNs for Detection, Tracking, and Segmentation Object Detection Bohyung Han Computer Vision Lab. [email protected] 2 3 Object detection
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
An Introduction to Neural Networks
An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,
