Foundations of Data Science 1


 Clarence Hancock
 1 years ago
 Views:
Transcription
1 Foundations of Data Science John Hopcroft Ravindran Kannan Version 2/8/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes are in good enough shape to prepare lectures for a modern theoretical course in computer science. Please do not put solutions to exercises online as it is important for students to work out solutions for themselves rather than copy them from the internet. Thanks JEH Copyright 20. All rights reserved
2 Contents Introduction 7 2 HighDimensional Space 0 2. Properties of HighDimensional Space The Law of Large Numbers The HighDimensional Sphere The Sphere and the Cube in High Dimensions Volume and Surface Area of the Unit Sphere The Volume is Near the Equator The Volume is in a Narrow Annulus The Surface Area is Near the Equator Volumes of Other Solids Generating Points Uniformly at Random on the Surface of a Sphere Gaussians in High Dimension Bounds on Tail Probability Applications of the tail bound Random Projection and JohnsonLindenstrauss Theorem Bibliographic Notes Exercises BestFit Subspaces and Singular Value Decomposition (SVD) Singular Vectors Singular Value Decomposition (SVD) Best Rank k Approximations Left Singular Vectors Power Method for Computing the Singular Value Decomposition Applications of Singular Value Decomposition Principal Component Analysis Clustering a Mixture of Spherical Gaussians Spectral Decomposition Singular Vectors and Ranking Documents An Application of SVD to a Discrete Optimization Problem Singular Vectors and Eigenvectors Bibliographic Notes Exercises Random Graphs The G(n, p) Model Degree Distribution Existence of Triangles in G(n, d/n) Phase Transitions The Giant Component
3 4.4 Branching Processes Cycles and Full Connectivity Emergence of Cycles Full Connectivity Threshold for O(ln n) Diameter Phase Transitions for Increasing Properties Phase Transitions for CNFsat Nonuniform and Growth Models of Random Graphs Nonuniform Models Giant Component in Random Graphs with Given Degree Distribution Growth Models Growth Model Without Preferential Attachment Growth Model With Preferential Attachment Small World Graphs Bibliographic Notes Exercises Random Walks and Markov Chains Stationary Distribution Electrical Networks and Random Walks Random Walks on Undirected Graphs with Unit Edge Weights Random Walks in Euclidean Space The Web as a Markov Chain Markov Chain Monte Carlo MetropolisHasting Algorithm Gibbs Sampling Areas and Volumes Convergence of Random Walks on Undirected Graphs Using Normalized Conductance to Prove Convergence Bibliographic Notes Exercises Learning and VCdimension Learning Linear Separators, the Perceptron Algorithm, and Margins Nonlinear Separators, Support Vector Machines, and Kernels Strong and Weak Learning  Boosting Number of Examples Needed for Prediction: VCDimension VapnikChervonenkis or VCDimension Examples of Set Systems and Their VCDimension The Shatter Function Shatter Function for Set Systems of Bounded VCDimension Intersection Systems
4 6.7 The VC Theorem Simple Learning Bibliographic Notes Exercises Algorithms for Massive Data Problems Frequency Moments of Data Streams Number of Distinct Elements in a Data Stream Counting the Number of Occurrences of a Given Element Counting Frequent Elements The Second Moment Matrix Algorithms Using Sampling Matrix Multiplication Using Sampling Sketch of a Large Matrix Sketches of Documents Exercises Clustering Some Clustering Examples A kmeans Clustering Algorithm A Greedy Algorithm for kcenter Criterion Clustering Spectral Clustering Recursive Clustering Based on Sparse Cuts Kernel Methods Agglomerative Clustering Dense Submatrices and Communities Flow Methods Finding a Local Cluster Without Examining the Whole Graph Axioms for Clustering An Impossibility Result A Satisfiable Set of Axioms Exercises Topic Models, Hidden Markov Process, Graphical Models, and Belief Propagation Topic Models Hidden Markov Model Graphical Models, and Belief Propagation Bayesian or Belief Networks Markov Random Fields Factor Graphs Tree Algorithms Message Passing in general Graphs Graphs with a Single Cycle
5 9.0 Belief Update in Networks with a Single Loop Maximum Weight Matching Warning Propagation Correlation Between Variables Exercises Other Topics Rankings Hare System for Voting Compressed Sensing and Sparse Vectors Unique Reconstruction of a Sparse Vector The Exact Reconstruction Property Restricted Isometry Property Applications Sparse Vector in Some Coordinate Basis A Representation Cannot be Sparse in Both Time and Frequency Domains Biological Finding Overlapping Cliques or Communities Low Rank Matrices Gradient Linear Programming The Ellipsoid Algorithm Integer Optimization SemiDefinite Programming Exercises Appendix 357. Asymptotic Notation Useful relations Useful Inequalities Probability Sample Space, Events, Independence Linearity of Expectation Union Bound Indicator Variables Variance Variance of the Sum of Independent Random Variables Median The Central Limit Theorem Probability Distributions Bayes Rule and Estimators Tail Bounds and Chernoff inequalities
6 .5 Eigenvalues and Eigenvectors Eigenvalues and Eigenvectors Symmetric Matrices Relationship between SVD and Eigen Decomposition Extremal Properties of Eigenvalues Eigenvalues of the Sum of Two Symmetric Matrices Norms Important Norms and Their Properties Linear Algebra Distance between subspaces Generating Functions Generating Functions for Sequences Defined by Recurrence Relationships The Exponential Generating Function and the Moment Generating Function Miscellaneous Lagrange multipliers Finite Fields Hash Functions Application of Mean Value Theorem Sperner s Lemma Prüfer Exercises Index 40 6
7 Foundations of Data Science John Hopcroft and Ravindran Kannan 2/8/204 Introduction Computer science as an academic discipline began in the 60 s. Emphasis was on programming languages, compilers, operating systems, and the mathematical theory that supported these areas. Courses in theoretical computer science covered finite automata, regular expressions, context free languages, and computability. In the 70 s, algorithms was added as an important component of theory. The emphasis was on making computers useful. Today, a fundamental change is taking place and the focus is more on applications. There are many reasons for this change. The merging of computing and communications has played an important role. The enhanced ability to observe, collect and store data in the natural sciences, in commerce, and in other fields calls for a change in our understanding of data and how to handle it in the modern setting. The emergence of the web and social networks, which are by far the largest such structures, presents both opportunities and challenges for theory. While traditional areas of computer science are still important and highly skilled individuals are needed in these areas, the majority of researchers will be involved with using computers to understand and make usable massive data arising in applications, not just how to make computers useful on specific welldefined problems. With this in mind we have written this book to cover the theory likely to be useful in the next 40 years, just as automata theory, algorithms and related topics gave students an advantage in the last 40 years. One of the major changes is the switch from discrete mathematics to more of an emphasis on probability, statistics, and numerical methods. Early drafts of the book have been used for both undergraduate and graduate courses. Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems. This book starts with the treatment of high dimensional geometry. Modern data in diverse fields such as Information Processing, Search, Machine Learning, etc., is often Copyright 20. All rights reserved 7
8 represented advantageously as vectors with a large number of components. This is so even in cases when the vector representation is not the natural first choice. Our intuition from two or three dimensional space can be surprisingly off the mark when it comes to high dimensional space. Chapter 2 works out the fundamentals needed to understand the differences. The emphasis of the chapter, as well as the book in general, is to get across the mathematical foundations rather than dwell on particular applications that are only briefly described. The mathematical areas most relevant to dealing with highdimensional data are matrix algebra and algorithms. We focus on singular value decomposition, a central tool in this area. Chapter 4 gives a fromfirstprinciples description of this. Applications of singular value decomposition include principal component analysis, a widely used technique which we touch upon, as well as modern applications to statistical mixtures of probability densities, discrete optimization, etc., which are described in more detail. Central to our understanding of large structures, like the web and social networks, is building models to capture essential properties of these structures. The simplest model is that of a random graph formulated by Erdös and Renyi, which we study in detail proving that certain global phenomena, like a giant connected component, arise in such structures with only local choices. We also describe other models of random graphs. One of the surprises of computer science over the last two decades is that some domainindependent methods have been immensely successful in tackling problems from diverse areas. Machine learning is a striking example. We describe the foundations of machine learning, both learning from given training examples, as well as the theory of Vapnik Chervonenkis dimension, which tells us how many training examples suffice for learning. Another important domainindependent technique is based on Markov chains. The underlying mathematical theory, as well as the connections to electrical networks, forms the core of our chapter on Markov chains. The field of algorithms has traditionally assumed that the input data to a problem is presented in random access memory, which the algorithm can repeatedly access. This is not feasible for modern problems. The streaming model and other models have been formulated to better reflect this. In this setting, sampling plays a crucial role and, indeed, we have to sample on the fly. in Chapter?? we study how to draw good samples efficiently and how to estimate statistical, as well as linear algebra quantities, with such samples. One of the most important tools in the modern toolkit is clustering, dividing data into groups of similar objects. After describing some of the basic methods for clustering, such as the kmeans algorithm, we focus on modern developments in understanding these, as well as newer algorithms. The chapter ends with a study of clustering criteria. This book also covers graphical models and belief propagation, ranking and voting, 8
9 sparse vectors, and compressed sensing. The appendix includes a wealth of background material. A word about notation in the book. To help the student, we have adopted certain notations, and with a few exceptions, adhered to them. We use lower case letters for scaler variables and functions, bold face lower case for vectors, and upper case letters for matrices. Lower case near the beginning of the alphabet tend to be constants, in the middle of the alphabet, such as i, j, and k, are indices in summations, n and m for integer sizes, and x, y and z for variables. Where the literature traditionally uses a symbol for a quantity, we also used that symbol, even if it meant abandoning our convention. If we have a set of points in some vector space, and work with a subspace, we use n for the number of points, d for the dimension of the space, and k for the dimension of the subspace. The term almost surely means with probability one. We use ln n for the natural logarithm and log n for the base two logarithm. If we want base ten, we will use log 0. To simplify notation and to make it easier to read we use E 2 ( x) for ( E( x) ) 2 and E( x) 2 for E ( ( x) 2). 9
10 2 HighDimensional Space In many applications data is in the form of vectors. In other applications, data is not in the form of vectors, but could be usefully represented by vectors. The Vector Space Model [SWY75] is a good example. In the vector space model, a document is represented by a vector, each component of which corresponds to the number of occurrences of a particular term in the document. The English language has on the order of 25,000 words or terms, so each document is represented by a 25,000 dimensional vector. A collection of n documents is represented by a collection of n vectors, one vector per document. The vectors may be arranged as columns of a 25, 000 n matrix. See Figure 2.. A query is also represented by a vector in the same space. The component of the vector corresponding to a term in the query, specifies the importance of the term to the query. To find documents about cars that are not race cars, a query vector will have a large positive component for the word car and also for the words engine and perhaps door, and a negative component for the words race, betting, etc. One needs a measure of relevance or similarity of a query to a document. The dot product or cosine of the angle between the two vectors is an often used measure of similarity. To respond to a query, one computes the dot product or the cosine of the angle between the query vector and each document vector and returns the documents with the highest values of these quantities. While it is by no means clear that this approach will do well for the information retrieval problem, many empirical studies have established the effectiveness of this general approach. The vector space model is useful in ranking or ordering a large collection of documents in decreasing order of importance. For large collections, an approach based on human understanding of each document is not feasible. Instead, an automated procedure is needed that is able to rank documents with those central to the collection ranked highest. Each document is represented as a vector with the vectors forming the columns of a matrix A. The similarity of pairs of documents is defined by the dot product of the vectors. All pairwise similarities are contained in the matrix product A T A. If one assumes that the documents central to the collection are those with high similarity to other documents, then computing A T A enables one to create a ranking. Define the total similarity of document i to be the sum of the entries in the i th row of A T A and rank documents by their total similarity. It turns out that with the vector representation on hand, a better way of ranking is to first find the best fit direction. That is, the unit vector u, for which the sum of squared perpendicular distances of all the vectors to u is minimized. See Figure 2.2. Then, one ranks the vectors according to their dot product with u. The bestfit direction is a wellstudied notion in linear algebra. There is elegant theory and efficient algorithms presented in Chapter 3 that facilitate the ranking as well as applications in many other domains. In the vector space representation of data, properties of vectors such as dot products, 0
11 Figure 2.: A document and its termdocument vector along with a collection of documents represented by their termdocument vectors. distance between vectors, and orthogonality, often have natural interpretations and this is what makes the vector representation more important than just a book keeping device. For example, the squared distance between two 0 vectors representing links on web pages is the number of web pages linked to by only one of the pages. In Figure 2.3, pages 4 and 5 both have links to pages, 3, and 6, but only page 5 has a link to page 2. Thus, the squared distance between the two vectors is one. We have seen that dot products measure similarity. Orthogonality of two nonnegative vectors says that they are disjoint. Thus, if a document collection, e.g., all news articles of a particular year, contained documents on two or more disparate topics, vectors corresponding to documents from different topics would be nearly orthogonal. The dot product, cosine of the angle, distance, etc., are all measures of similarity or dissimilarity, but there are important mathematical and algorithmic differences between them. The random projection theorem presented in this chapter states that a collection of vectors can be projected to a lowerdimensional space approximately preserving all pairwise distances between vectors. Thus, the nearest neighbors of each vector in the collection can be computed in the projected lowerdimensional space. Such a savings in time is not possible for computing pairwise dot products using a simple projection. Our aim in this book is to present the reader with the mathematical foundations to deal with highdimensional data. There are two important parts of this foundation. The first is highdimensional geometry, along with vectors, matrices, and linear algebra. The second more modern aspect is the combination with probability. High dimensionality is a common characteristic in many models and for this reason much of this chapter is devoted to the geometry of highdimensional space, which is quite different from our intuitive understanding of two and three dimensions. We focus first on volumes and surface areas of highdimensional objects like hyperspheres. We will not present details of any one application, but rather present the fundamental theory useful to many applications. One reason probability comes in is that many computational problems are hard if our algorithms are required to be efficient on all possible data. In practical situations, domain knowledge often enables the expert to formulate stochastic models of data. In
12 best fit line Figure 2.2: The best fit line is the line that minimizes the sum of the squared perpendicular distances. (,0,,0,0,) web page 4 (,,,0,0,) web page 5 Figure 2.3: Two web pages as vectors. The squared distance between the two vectors is the number of web pages linked to by just one of the two web pages. customerproduct data, a common assumption is that the goods each customer buys are independent of what goods the others buy. One may also assume that the goods a customer buys satisfies a known probability law, like the Gaussian distribution. In keeping with the spirit of the book, we do not discuss specific stochastic models, but present the fundamentals. An important fundamental is the law of large numbers that states that under the assumption of independence of customers, the total consumption of each good is remarkably close to its mean value. The central limit theorem is of a similar flavor. Indeed, it turns out that picking random points from geometric objects like hyperspheres exhibits almost identical properties in high dimensions. One calls this phenomena the law of large dimensions. We will establish these geometric properties first before discussing Chernoff bounds and related theorems on aggregates of independent random variables. 2. Properties of HighDimensional Space Our intuition about space was formed in two and three dimensions and is often misleading in high dimensions. Consider placing 00 points uniformly at random in a unit square. Each coordinate is generated independently and uniformly at random from the interval [0, ]. Select a point and measure the distance to all other points and observe 2
13 the distribution of distances. Then increase the dimension and generate the points uniformly at random in a 00dimensional unit cube. The distribution of distances becomes concentrated about an average distance. The reason is easy to see. Let x and y be two such points in ddimensions. The distance between x and y is x y = d (x i y i ) 2. i= Since d i= (x i y i ) 2 is the summation of a number of independent random variables of bounded variance, by the law of large numbers the distribution of x y 2 is concentrated about its expected value. Contrast this with the situation where the dimension is two or three and the distribution of distances is spread out. For another example, consider the difference between picking a point uniformly at random from a unitradius circle and from a unitradius sphere in ddimensions. In d dimensions the distance from the point to the center of the sphere is very likely to be between c and, where c is a constant independent of d. This implies that most of d the mass is near the surface of the sphere. Furthermore, the first coordinate, x, of such a point is likely to be between c d and + c d, which we express by saying that most of the mass is near the equator. The equator perpendicular to the x axis is the set {x x = 0}. We will prove these results in this chapter, but first a review of some probability. 2.2 The Law of Large Numbers In the previous section, we claimed that points generated at random in high dimensions were all essentially the same distance apart. The reason is that if one averages n independent samples x, x 2,..., x n of a random variable x, the result will be close to the expected value of x. Specifically the probability that the average will differ from the expected value by more than ɛ is less than some value σ2. nɛ 2 ( ) x + x x n Prob E(x) n > ɛ σ2 nɛ. (2.) 2 Here the σ 2 in the numerator is the variance of x. The larger the variance of the random variable, the greater the probability that the error will exceed ɛ. The number of points n is in the denominator since the more values that are averaged, the smaller the probability that the difference will exceed ɛ. Similarly the larger ɛ is, the smaller the probability that the difference will exceed ɛ and hence ɛ is in the denominator. Notice that squaring ɛ makes the fraction a dimensionalless quantity. To prove the law of large numbers we use two inequalities. The first is Markov s inequality. One can bound the probability that a nonnegative random variable exceeds a by the expected value of the variable divided by a. 3
14 Theorem 2. (Markov s inequality) Let x be a nonnegative random variable. Then for a > 0, Prob(x a) E(x) a. Proof: We prove the theorem for continuous random variables. So we use integrals. The same proof works for discrete random variables with sums instead of integrals. E (x) = xp(x)dx = a xp(x)dx + xp(x)dx xp(x)dx 0 0 a ap(x)dx = a p(x)dx = ap(x a) a a a Thus, Prob(x a) E(x) a. Corollary 2.2 Prob (x ce(x)) c Proof: Substitute ce(x) for a. Markov s inequality bounds the tail of a distribution using only information about the mean. A tighter bound can be obtained by also using the variance. Theorem 2.3 (Chebyshev s inequality) Let x be a random variable with mean m and variance σ 2. Then Prob( x m aσ) a 2. Proof: Prob( x m aσ) = Prob ( (x m) 2 a 2 σ 2). Note that (x m) 2 is a nonnegative random variable, so Markov s inequality can be a applied giving: Prob ( (x m) 2 a 2 σ 2) E ( (x m) 2) = σ2 a 2 σ 2 a 2 σ = 2 a. 2 Thus, Prob ( x m aσ) a 2. The law of large numbers follows from Chebyshev s inequality. Recall that E(x + y) = E(x) + E(y), σ 2 (cx) = c 2 σ 2 (x), σ 2 (x m) = σ 2 (x), and if x and y are independent, then E(xy) = E(x)E(y) and σ 2 (x + y) = σ 2 (x) + σ 2 (y). To prove σ 2 (x + y) = σ 2 (x) + σ 2 (y) when x and y are independent, since σ 2 (x m) = σ 2 (x), one can assume E(x) = 0 and E(y) = 0. Thus, σ 2 (x + y) = E ( (x + y) 2) = E(x 2 ) + E(y 2 ) + 2E(xy) = E(x 2 ) + E(y 2 ) + 2E(x)E(y) = σ 2 (x) + σ 2 (y). Replacing E(xy) by E(x)E(y) required independence. 4
15 2 2 d Figure 2.4: Illustration of the relationship between the sphere and the cube in 2, 4, and ddimensions. Theorem 2.4 (Law of large numbers) Let x, x 2,..., x n be n samples of a random variable x. Then ( ) x + x x n Prob E(x) n > ɛ σ2 nɛ 2 Proof: By Chebychev s inequality ( ) x + x x n ( Prob E(x) n > ɛ σ2 x +x 2 + +x n ) n ɛ 2 n 2 ɛ 2 σ2 (x + x x n ) ( σ 2 (x n 2 ɛ 2 ) + σ 2 (x 2 ) + + σ 2 (x n ) ) σ2 (x) nɛ. 2 The law of large numbers bounds the difference of the sample average and the expected value. Note that the size of the sample for a given error bound is independent of the size of the population class. In the limit, when the sample size goes to infinity, the central limit theorem says that the distribution of the sample average is Gaussian provided the random variable has finite variance. Later, we will consider random variables that are the sum of random variables. That is, x = x + x x n. Chernoff bounds will tell us about the probability of x differing from its expected value. We will delay this until Section The HighDimensional Sphere One of the interesting facts about a unitradius sphere in high dimensions is that as the dimension increases, the volume of the sphere goes to zero. This has important 5
Foundations of Data Science 1
Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes
More information1 HighDimensional Space
Contents HighDimensional Space. Properties of HighDimensional Space..................... 4. The HighDimensional Sphere......................... 5.. The Sphere an the Cube in Higher Dimensions...........
More informationWHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? 1. Introduction
WHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? DAVIDE P. CERVONE, WILLIAM V. GEHRLEIN, AND WILLIAM S. ZWICKER Abstract. Consider an election in which each of the n voters casts a vote consisting of
More informationAn Elementary Introduction to Modern Convex Geometry
Flavors of Geometry MSRI Publications Volume 3, 997 An Elementary Introduction to Modern Convex Geometry KEITH BALL Contents Preface Lecture. Basic Notions 2 Lecture 2. Spherical Sections of the Cube 8
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289. Compressed Sensing. David L. Donoho, Member, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006 1289 Compressed Sensing David L. Donoho, Member, IEEE Abstract Suppose is an unknown vector in (a digital image or signal); we plan to
More informationHighDimensional Image Warping
Chapter 4 HighDimensional Image Warping John Ashburner & Karl J. Friston The Wellcome Dept. of Imaging Neuroscience, 12 Queen Square, London WC1N 3BG, UK. Contents 4.1 Introduction.................................
More informationA Tutorial on Spectral Clustering
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Spemannstr. 38, 7276 Tübingen, Germany ulrike.luxburg@tuebingen.mpg.de This article appears in Statistics
More informationSearching in a Small World
THESIS FOR THE DEGREE OF LICENTIATE OF PHILOSOPHY Searching in a Small World Oskar Sandberg Division of Mathematical Statistics Department of Mathematical Sciences Chalmers University of Technology and
More informationCOSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES
COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES D NEEDELL AND J A TROPP Abstract Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect
More informationIndexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637
Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435
More informationRegression. Chapter 2. 2.1 Weightspace View
Chapter Regression Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction
More informationA Tutorial on Support Vector Machines for Pattern Recognition
c,, 1 43 () Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. A Tutorial on Support Vector Machines for Pattern Recognition CHRISTOPHER J.C. BURGES Bell Laboratories, Lucent Technologies
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 ActiveSet Newton Algorithm for Overcomplete NonNegative Representations of Audio Tuomas Virtanen, Member,
More informationA First Encounter with Machine Learning. Max Welling Donald Bren School of Information and Computer Science University of California Irvine
A First Encounter with Machine Learning Max Welling Donald Bren School of Information and Computer Science University of California Irvine November 4, 2011 2 Contents Preface Learning and Intuition iii
More informationAn Introduction to Tensors for Students of Physics and Engineering
NASA/TM 2002211716 An Introduction to Tensors for Students of Physics and Engineering Joseph C. Kolecki Glenn Research Center, Cleveland, Ohio September 2002 The NASA STI Program Office... in Profile
More informationSampling 50 Years After Shannon
Sampling 50 Years After Shannon MICHAEL UNSER, FELLOW, IEEE This paper presents an account of the current state of sampling, 50 years after Shannon s formulation of the sampling theorem. The emphasis is
More informationA Googlelike Model of Road Network Dynamics and its Application to Regulation and Control
A Googlelike Model of Road Network Dynamics and its Application to Regulation and Control Emanuele Crisostomi, Steve Kirkland, Robert Shorten August, 2010 Abstract Inspired by the ability of Markov chains
More informationOPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
More informationIntroduction to partitioningbased clustering methods with a robust example
Reports of the Department of Mathematical Information Technology Series C. Software and Computational Engineering No. C. 1/2006 Introduction to partitioningbased clustering methods with a robust example
More informationHow to Use Expert Advice
NICOLÒ CESABIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California
More informationSketching as a Tool for Numerical Linear Algebra
Foundations and Trends R in Theoretical Computer Science Vol. 10, No. 12 (2014) 1 157 c 2014 D. P. Woodruff DOI: 10.1561/0400000060 Sketching as a Tool for Numerical Linear Algebra David P. Woodruff IBM
More information1 Sets and Set Notation.
LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most
More informationModelling with Implicit Surfaces that Interpolate
Modelling with Implicit Surfaces that Interpolate Greg Turk GVU Center, College of Computing Georgia Institute of Technology James F O Brien EECS, Computer Science Division University of California, Berkeley
More informationONEDIMENSIONAL RANDOM WALKS 1. SIMPLE RANDOM WALK
ONEDIMENSIONAL RANDOM WALKS 1. SIMPLE RANDOM WALK Definition 1. A random walk on the integers with step distribution F and initial state x is a sequence S n of random variables whose increments are independent,
More informationFrom Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images
SIAM REVIEW Vol. 51,No. 1,pp. 34 81 c 2009 Society for Industrial and Applied Mathematics From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein David
More informationON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION. A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT
ON THE DISTRIBUTION OF SPACINGS BETWEEN ZEROS OF THE ZETA FUNCTION A. M. Odlyzko AT&T Bell Laboratories Murray Hill, New Jersey ABSTRACT A numerical study of the distribution of spacings between zeros
More informationSubspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity
Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at UrbanaChampaign
More informationFourier Theoretic Probabilistic Inference over Permutations
Journal of Machine Learning Research 10 (2009) 9971070 Submitted 5/08; Revised 3/09; Published 5/09 Fourier Theoretic Probabilistic Inference over Permutations Jonathan Huang Robotics Institute Carnegie
More informationOrthogonal Bases and the QR Algorithm
Orthogonal Bases and the QR Algorithm Orthogonal Bases by Peter J Olver University of Minnesota Throughout, we work in the Euclidean vector space V = R n, the space of column vectors with n real entries
More information