Machine Learning Overview

Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1

Outline 1. What is Machine Learning (ML)? 1. As a scientific Discipline 2. As an area of Computer Science/AI 2. Core learning methods 1. Supervised (Regression/Classification/Deep) 2. Unsupervised (PCA, Clustering, Topic Models) 3. Reinforcement 3. Main drivers 1. Mobile systems (big data) 2. Personalization 2

Machine Learning as a Discipline Focused on two inter-related fundamental scientific/engineering questions 1. How can one construct computer systems that automatically improve through experience? 2. What are the statistical-computational-informationtheoretic laws that govern all learning systems Including computers, humans and organizations? Machine learning is also important for highly practical computer software fielded across many applications 3

Machine Learning as Software Area Programming computers to: Perform tasks that humans perform well but difficult to specify algorithmically Principled way of building high performance information processing systems Probabilistic responses to queries IR Adaptive user interfaces, personalized assistants (information systems) Scientific/engineering applications 4

ML within AI ML has emerged as method of choice for practical software for: Computer vision Speech recognition Natural language processing Robot control Other applications Far easier to train by showing examples of input-output behavior Than manually anticipate response for every input 5

Example Problem: Handwritten Digit Recognition Wide variability of same numeral Handcrafted rules will result in large no of rules and exceptions Better to have a machine that learns from a large training set Handwriting recognition cannot be done without machine learning! 6

Most Successful Application of ML Learning to recognize spoken words Speaker-specific strategies for recognizing primitive sounds (phonemes) and words from speech signal Neural networks and methods for learning HMMs for customizing to individual speakers, vocabularies and microphone characteristics Recently Google increased accuracy for Table 1.1 Android by 25% 7

ML Example: Self-Driving Vehicle ALVINN: Drive at 70mph for 90 miles on public highways Google Prototype Tesla Autopilot Learning to drive an autonomous vehicle Train computer-controlled vehicles to steer correctly Associate steering commands with image sequences Deployment: Taxi Courier Service 8

Drivers of ML Progress Mobile systems gather/transport vast amounts of data: Big data Turn to ML solutions to obtain insights, predictions, decisions Granularized personalized data Personalization: relevance of posts shown Advertising copywriting Historical medical records: Determine treatment Historical traffic data: Control congestion 9

Learning Problem Definition Improving some measure of performance P when executing some task T through some type of training experience E Example: Learning to detect credit card fraud Task T Assign label of fraud or not fraud to credit card transaction Performance measure P Accuracy of fraud classifier With higher penalty when fraud is labeled as not fraud Training experience E Historical credit card transactions labeled as fraud or not 10

The ML Approach Data Collection Samples Model Selection Probability distribution to model process Parameter Estimation Values/distributions Generalization (Training) Inference Find responses to queries Decision (Inference OR Testing) 11

ML History within AI ML/PR Methods around for over 50 years 12

Core Methods 1. Supervised Learning Training data consists of (x,y) pairs Goal is prediction y* for input x* 2. Unsupervised Learning Analysis of unlabeled data 3. Reinforcement Learning Training data inbetween supervised/unsupervised Indication of whether action is correct or not Rewad signal may refer to an entire input sequence 13

Supervised Learning Most widely used methods of ML, e.g., Spam classification of email Face recognizers over images Medical diagnosis systems Inputs x are vectors or more complex objects documents, DNA sequences or graphs Outputs are binary, multiclass(k), Multi-label (more than one class), ranking, Structured: y is a graph satisfying constraints, e.g., POS tagging 14 Real-valued or mixture of discrete and real-valued

Supervised Classification Example Off-shore oil transfer pipelines Non-invasive measurement of proportion of oil,water, gas Called Three-phase Oil/Water/Gas Flow Input data: Dual-energy gamma densitometry Beam of gamma rays passed through pipe Attenuation in intensity indicates density of material Single beam insufficient Two degrees of freedom: fraction of oil, fraction of water One beam of Gamma rays of two energies (frequencies) Detector Six Beams 12 measurements attenuation

Prediction Problems 1. Predict Volume Fractions of oil/water/gas 2. Predict configuration (one of three) Twelve Features Three classes Two variables, 100 points shown Naïve cell based voting fails exponential growth of cells with dimensionality 12 dimensions discretized into 6 gives 3 million cells Hardly any points in each cell Which class should x belong to? 16

Probability Theory Sum Rule for Marginalization L p(x = x i ) = p(x = x i,y = y j ) j=1 Product Rule: for combining p(x,y ) = n ij N = p(y X)p(X) Bayes Rule p( X Y) p( Y) p ( Y X ) = where p ( X ) = p( X Y) p( Y) Y p( X ) Viewed as Posterior α likelihood x prior Fully Bayesian approach Conjugate distributions Feasible with increased computational power Intractable posterior handled using either Variational Bayes or Stochastic sampling e.g., Markov Chain Monte Carlo, Gibbs 17

Probability Distributions Discrete- Binary Binomial N samples of Bernoulli N=1 Bernoulli Single binary variable Conjugate Prior Beta Continuous variable between {0,1] Discrete- Multi-valued Large N Multinomial One of K values = K-dimensional binary vector Conjugate Prior K=2 Dirichlet K random variables between [0.1] Continuous Gaussian Student s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians Gamma ConjugatePrior of univariate Gaussian precision Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision Wishart Conjugate Prior of multivariate Gaussian precision matrix Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix Exponential Special case of Gamma Angular Von Mises Uniform 18

Statistical Models Generative Naïve Bayes Mixtures of multinomials Mixtures of Gaussians Hidden Markov Models (HMM) Bayesian networks Markov random fields Discriminative Logistic regression SVMs Traditional neural networks Nearest neighbor Conditional Random Fields (CRF) 19

HMMs for Speech Recognition Three distinct layers 1. Language Model: generates sentences as sequences of words 2. Word Model: described as a sequence of phonemes /p//u//sh/ 3. Acoustic model: shows progression of the acoustic signal through a phoneme 20

DBN for monitoring a vehicle Represents system dynamics X 5 : Observation depends on car s location (and map not modeled) and error status of sensor (failure) (X 4 ) Weather Weather' Weather 0 X 1 : Bad weather makes sensor likely to fail (X 4 ) Velocity Location Velocity' Location' Velocity 0 Location 0 X 3 : Location depends on previous position and velocity (X 2 ) Failure Failure' Obs' Failure 0 Ob Time slice t Time slice t +1 Time slice 0 (a) 21 (b)

Regression Problem data set Corresponding inverse problem by reversing x and t Red curve is result of fitting a two-layer neural network by minimizing squared error Very poor fit to data: GMMs used here 22

Regression: Learning to Rank Input (x i ): (d Features of Query-URL pair) (d >200) In LETOR 4.0 dataset 46 query-document features Maximum of 124 URLs/query Log frequency of query in anchor text Query word in color on page # of images on page # of (out) links on page PageRank of page URL length URL contains ~ Page length Traditional IR uses TF/IDF Output (y): Relevance Value Target Variable - Point-wise (0,1,2,3) - Regression returns continuous value - Allows fine-grained ranking of URLs

Deep Learning Multilayer stack of simple modules subject to: Learning, Non-linear map (ReLU) 5 to 20 layers Sensitive to minute details (Samoyeds from white wolves) Invariant (Background, pose, lighting, other objects) Convolutional Nets alternate convolutional layer and pooling layer Stunning success ConvNet +Recurrent Net 1. Representation by CNN 24 2. RNN trained to translate

Unsupervised Learning Labeled data under assumption of underlying structure of data, e.g., 1. Clustering is to find partition of data 2. Identify a low-dimensional manifold PCA, manifold learning, factor analysis, random projections, auto-encoders Topic modeling, Recommendation systems A criterion function is used e.g., max likelihood Computational complexity is key to exploit large unlabeled data sets 25

Clustering Finding a partition for observed data And a rule for predicting future data Old Faithful Geyser in Yellowstone Simple Gaussian unable to capture structure Linear superposition of two Gaussians is better Gaussian cannot model such data sets Gaussian Mixture Models give very complex densities K p( x) = π N( x µ, Σ ) k = 1 k k k 272 observations Duration (mins, horiz axis) vs Time to next eruption (vertical) p k are mixing coefficients that sum to one Log-likelihood function is N K ln p ( X π, µ, Σ) = ln π k ( n k Σk ) = N x µ, = n 1 k 1 There is no closed-form solution Use either iterative numerical optimization techniques or Expectation Maximization One dimension Three Gaussians in blue Sum in red 26

Topic Models Unsupervised methods to analyze documents Topics are distributions over words A document is a distribution across topics Methods: SVD, Collaborative Filtering Topics Topic 1 training 0.08 network 0.05 neural 0.03 Topic 2 noise 0.017 uncertain 0.011 reliability 0.010 positive 0.0084.. Topic 3 data 0.1041 estimate 0.020 estimation 0.019. An Example of Topic Modeling The ability to learn from data with uncertain and missing information is a fundamental requirement for learning systems. In the "real world", features are missing due to unrecorded information or due to occlusion in vision, and measurements are affected by noise. In some cases the experimenter might want to assign varying degrees of reliability to the data. In regression, uncertainty is typically attributed to the dependent variable which is assumed to be disturbed by additive noise. But there is no reason to assume that input features might not be uncertain as well or even missing completely. In some cases, we can ignore the problem: instead of trying to model the relationship between the true input and the output we are satisfied with modeling the relationship between the uncertain input and the output. But there are at least two reasons why we might want to explicitly deal with uncertain inputs. First, we might be interested in the underlying relationship between the true input and the output (e.g. the relationship has some physical meaning). Second, the problem might be non-stationary in the sense that for different samples different inputs are uncertain or missing or the levels of uncertainty vary. The naive strategy of training networks for all possible input combinations explodes in complexity and would require sufficient data for all relevant cases. It makes more sense to define one underlying true model and relate all data to this one model. Ahmad and Tresp (1993) have shown how to include uncertainty during recall under the assumption that the network approximates the "true" underlying function. In this paper, we first show how input uncertainty can be taken into account in the training of a feedforward neural network. Then we show that for networks of Gaussian basis functions it is possible to obtain closed-form solutions. We validate the solutions on two applications. Topic Distribution Topic 1 Topic 2 Topic 3 27

Recommendation Systems Data indicates links between users and items Suggest other items to a user based on data across all users Solution: SVD, Collaborative Filtering 28

Reinforcement Learning Dog is given a reward/punishment for an action Policies: what actions to take in a particular situation Utility estimation: how good is state (à used by policy) No supervised output but delayed reward Credit assignment what was responsible for outcome Applications: Game playing Robot in a maze Multiple agents, partial observability, 29

Causal Learning A Causal Bayesian Network Example of Inference: Cancer is independent of Age and Gender given exposure to Toxics and Smoking Computationally feasible inference 30

Summary Machine Learning Discipline Study how systems improve with experience Study statistical-computational-information-theoretic laws governing learning systems Used with success in all AI applications Core methods: Supervised Classification, Regression, Ranking Fully Bayesian approach together with Variational methods and Monte Carlo sampling Deep Learning Unsupervised (PCA, Topic Models, Clustering) Reinforcement Drivers are mobile systems (big data), personalization 31