# APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder

2 Course objectives: The purpose of this course is to teach efficient algorithms for processing very large datasets. Specifically, we are interested in algorithms that are based on low rank approximation. Given an m n matrix A, it is sometimes possible to build an approximate factorization A E F. m n m k k n In the applications we have in mind, m and n can be very large (think m, n 10 6 for dense matrices, and much larger for sparse matrices), and k min(m, n). Factorizing matrices can be very helpful: Storing A requires mn words of storage. Storing E and F requires km + kn words of storage. Given a vector x, computing Ax requires mn flops. Given a vector x, computing Ax = E(Fx) requires km + kn flops. The factors E and F are often useful for data interpretation.

3 Examples of data interpretation via matrix factorization: Principal Component Analysis: Form an empirical covariance matrix from some collection of statistical data. By computing the singular value decomposition of the matrix, you find the directions of maximal variance. Finding spanning columns or rows: Collect statistical data in a large matrix. By finding a set of spanning columns, you can identify some variables that explain the data. (Say a small collection of genes among a set of recorded genomes, or a small number of stocks in a portfolio.) Relaxed solutions to k-means clustering: Partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Relaxed solutions can be found via the singular value decomposition. PageRank: Form a matrix representing the link structure of the internet. Determine its leading eigenvectors. Related algorithms for graph analysis. Eigenfaces: Form a matrix whose columns represent normalized grayscale images of faces. The eigenfaces are the left singular vectors.

4 Main components of the course: Matrix factorization algorithms designed for a modern computing environment We will discuss algorithms designed to scale correctly for very large matrices when executed on multicore processors, GPUs, parallel computers, distributed ( cloud ) computers, etc. The key is to minimize communication and energy consumption (as opposed to flops). We will extensively explore new randomized algorithms. Applications of matrix factorization We will discuss a range of applications from computational statistics, machine learning, image processing, etc. Techniques include: Principal Componen Analysis, Latent Semantic Indexing, Linear Regression, PageRank, etc. The idea of randomized projections for dimension reduction So called Johnson-Lindenstrauss techniques are useful in linear algebra, algorithm design, and data analysis. We will explore mathematical properties of randomized projections, and their use in different contexts. Other useful scalable algorihms Fast Fourier Transform an extremely versatile and powerful tool that parellelizes well. Applications to signal processing, solving PDEs, image compression, etc. Fast Multipole Method another powerful and scalable algorithm with applications to solving PDEs, simulating gravitation and electrostatics, and much more.

5 Week: Material covered: 1: Low-rank approximation the problem formulation and a brief survey of applications. The Singular Value Decomposition (SVD) and the Eckart-Young theorem on optimality. 2: Power iterations, and Krylov methods. Gram-Schmidt and the QR factorization. How to cheaply get an approximate SVD from a QR factorization. 3: Interpretation of data. The interpolative decomposition (ID) and the CUR decomposition. 4,5: Randomized methods for computing low-rank approximations. Connection to QuickSort and Monte Carlo. 6,7: Applications of low-rank approximation: Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), eigenfaces, pagerank, potential evaluation. 8,9: Linear regression problems. L 2 and L 1 minimization. Brief introduction to linear programming. 10: Johnson-Lindenstrauss methods; random projections as a tool for dimensionality reduction. 11: Nearest neighbor search. 12: Clustering and the k-means problem. 13: The Fast Fourier Transform. 14: The Fast Multipole Method. 15: Project presentations. Note: This is a new course! The actual timeline may deviate from the target listed above some topics may need more or less time.

6 Top Ten Algorithms of the 20th Century 1. Monte Carlo Method (1946): 2. Simplex Method for linear programming (1947): 3. Krylov methods (1950): 4. Decompositional approach to matrix computations (1951): 5. Fortran optimizing compiler (1957): 6. QR algorithm for eigenvalues (1959): 7. Quicksort (1962): 8. Fast Fourier Transform (1965): 9. Integer relation detection algorithm (1977):??? 10. The Fast Multipole Method (1987): Algorithms set in red will be discussed directly in class. Algorithms set in blue will not be covered in any detail, but will be touched upon briefly. From Jan./Feb issue of Computing in Science & Engineering, as compiled by Jack Dongarra and Francis Sullivan posted on course webpage.

7 Many of the algorithms we will describe are based on 20th century algorithms, but have been re-engineered to better solve 21st century problems. Specifically, the course will extensively discuss randomized methods for matrix factorizations. These were developed in response to the need to process matrices for which classical algorithms were not designed. Difficulties include: The matrices involved can be very large (size , say). Constraints on communication few passes over the data. Data stored in slow memory. Parallel processors. Streaming data. Lots of noise hard to discern the signal from the noise. High-quality denoising algorithms tend to require global operations. Can you trade accuracy for speed?

8 Question: Computers get more powerful all the time. Can t we just use the algorithms we have, and wait for computers to get powerful enough?

9 Question: Computers get more powerful all the time. Can t we just use the algorithms we have, and wait for computers to get powerful enough? Observation 1: The size of the problems increase too! Tools for acquiring and storing data are improving at an even faster pace than processors.

10 Question: Computers get more powerful all the time. Can t we just use the algorithms we have, and wait for computers to get powerful enough? Observation 1: The size of the problems increase too! Tools for acquiring and storing data are improving at an even faster pace than processors. The famous deluge of data: documents, web searching, customer databases, hyper-spectral imagery, social networks, gene arrays, proteomics data, sensor networks, financial transactions, traffic statistics (cars, computer networks),...

11 Question: Computers get more powerful all the time. Can t we just use the algorithms we have, and wait for computers to get powerful enough? Observation 1: The size of the problems increase too! Tools for acquiring and storing data are improving at an even faster pace than processors. The famous deluge of data: documents, web searching, customer databases, hyper-spectral imagery, social networks, gene arrays, proteomics data, sensor networks, financial transactions, traffic statistics (cars, computer networks),... Observation 2: Good algorithms are necessary. The flop count must scale close to linearly with problem size.

12 Question: Computers get more powerful all the time. Can t we just use the algorithms we have, and wait for computers to get powerful enough? Observation 1: The size of the problems increase too! Tools for acquiring and storing data are improving at an even faster pace than processors. The famous deluge of data: documents, web searching, customer databases, hyper-spectral imagery, social networks, gene arrays, proteomics data, sensor networks, financial transactions, traffic statistics (cars, computer networks),... Observation 2: Good algorithms are necessary. The flop count must scale close to linearly with problem size. Observation 3: Communication is becoming the real bottleneck. Robust algorithms with good flop counts exist, but many were designed for an environment where you have random access to the data.

13 Question: Computers get more powerful all the time. Can t we just use the algorithms we have, and wait for computers to get powerful enough? Observation 1: The size of the problems increase too! Tools for acquiring and storing data are improving at an even faster pace than processors. The famous deluge of data: documents, web searching, customer databases, hyper-spectral imagery, social networks, gene arrays, proteomics data, sensor networks, financial transactions, traffic statistics (cars, computer networks),... Observation 2: Good algorithms are necessary. The flop count must scale close to linearly with problem size. Observation 3: Communication is becoming the real bottleneck. Robust algorithms with good flop counts exist, but many were designed for an environment where you have random access to the data. Communication speeds improve far more slowly than CPU speed and storage capacity. The capacity of fast memory close to the processor is growing very slowly. Much of the gain in processor and storage capability is attained via parallelization. This poses particular challenges for algorithmic design.

14 Logistics

15 Prerequisites: To take this course, it is essential that you know linear algebra well. A course such as APPM3310 (Matrix Methods) or the equivalent is absolutely necessary, as we will frequently use matrix factorizations such as the SVD and the QR. Familiarity with basic probability is also important (Gaussian and Bernoulli random numbers; the concepts of expectations, standard deviations, etc). Basic numerical analysis (accuracy, stability, floating point arithmetic, etc) is assumed. Knowledge of basic programming in Matlab is required, as is basic knowledge of analysis of algorithms such as estimating asymptotic complexity, etc. Finally, some familiarity with Fourier methods is very helpful. For one part of the course, knowledge of basic electrostatics and the properties of the Laplace and Helmholtz equations will be assumed, but this material covers only one or two weeks, so this is not an essential pre-requisite.

17 Text: The course is defined by the material covered in class, and there is no official text book. Course notes and links to papers discussed in class will be posted on the course website. Lecture notes / scribing: As a participant in the course, you will be required to sign up for two lectures for which you will serve as a scribe. During the lecture when you are a scribe, you will take careful notes, type them up after the class (using latex if at all possible), and them to the instructor within 48h. They will then be posted to the course webpage. A template for the scribe notes can be downloaded from the course webpage.

18 Homeworks: There will be 6 homeworks, due at the end of weeks 3, 5, 7, 9, 11, and 13. Working in groups is allowed and encouraged, with the maximal group size being 3 for 4720 and 2 for Each individual in the course (not each group!) will be required to sign up for one homework problem and be responsible for producing a reference solution. This should be a typed solution, and should include Matlab codes where appropriate. The instructor will review the submitted reference homework, and suggest edits/corrections where appropriate. Once the reference homework is complete, it will be posted to the course webpage as a solution. You are allowed to work in your homework group to prepare the reference homework, but only one student will get credit for the problem. Each regular homework set will be worth 5% of the grade. In addition, your reference homework problem will be worth 10%.

19 Project: Your grade in this course will to 40% be based on a final project. You are allowed (and encouraged!) to work in pairs on the project. Groups of three students could be allowed if the project chosen is particularly labor intensive, but this requires instructor permission. In the last week of the course, each group is expected to deliver a brief (10 15 minutes) presentation of the project, and to hand in a final project report. A number of suggested projects will be listed on the course webpage. You are also very welcome to think of projects on your own; if you want to go with this option, you need to discuss the chosen project with the instructor to get it approved. Please initiate this discussion no later than March 15, if possible. The expectation is that the project is based on material covered in the first two thirds of the course, and will be completed during the last third. You must pick a project and notify the instructor of what your project is by March 19.

### Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015 Lecture: MWF: 1:00-1:50pm, GEOLOGY 4645 Instructor: Mihai

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round \$200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

### Bindel, Fall 2012 Matrix Computations (CS 6210) Week 1: Wednesday, Aug 22

Logistics Week 1: Wednesday, Aug 22 We will go over the syllabus in the first class, but in general you should look at the class web page at http://www.cs.cornell.edu/~bindel/class/cs6210-f12/ The web

### SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated

SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET Course Title Course Number Department Linear Algebra Mathematics MAT-240 Action Taken (Please Check One) New Course Initiated

### Proposal for Undergraduate Certificate in Large Data Analysis

Proposal for Undergraduate Certificate in Large Data Analysis To: Helena Dettmer, Associate Dean for Undergraduate Programs and Curriculum From: Suely Oliveira (Computer Science), Kate Cowles (Statistics),

### BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

### PCA to Eigenfaces. CS 510 Lecture #16 March 23 th A 9 dimensional PCA example

PCA to Eigenfaces CS 510 Lecture #16 March 23 th 2015 A 9 dimensional PCA example is dark around the edges and bright in the middle. is light with dark vertical bars. is light with dark horizontal bars.

### FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

### Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001,

### Parallel Computing for Data Science

Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

### CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

### The Second Undergraduate Level Course in Linear Algebra

The Second Undergraduate Level Course in Linear Algebra University of Massachusetts Dartmouth Joint Mathematics Meetings New Orleans, January 7, 11 Outline of Talk Linear Algebra Curriculum Study Group

### Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear

### Linear Algebra. Introduction

Linear Algebra Caren Diefenderfer, Hollins University, chair David Hill, Temple University, lead writer Sheldon Axler, San Francisco State University Nancy Neudauer, Pacific University David Strong, Pepperdine

### ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB. Sohail A. Dianat. Rochester Institute of Technology, New York, U.S.A. Eli S.

ADVANCED LINEAR ALGEBRA FOR ENGINEERS WITH MATLAB Sohail A. Dianat Rochester Institute of Technology, New York, U.S.A. Eli S. Saber Rochester Institute of Technology, New York, U.S.A. (g) CRC Press Taylor

### Numerical Analysis. Professor Donna Calhoun. Fall 2013 Math 465/565. Office : MG241A Office Hours : Wednesday 10:00-12:00 and 1:00-3:00

Numerical Analysis Professor Donna Calhoun Office : MG241A Office Hours : Wednesday 10:00-12:00 and 1:00-3:00 Fall 2013 Math 465/565 http://math.boisestate.edu/~calhoun/teaching/math565_fall2013 What is

### An Application of Linear Algebra to Image Compression

An Application of Linear Algebra to Image Compression Paul Dostert July 2, 2009 1 / 16 Image Compression There are hundreds of ways to compress images. Some basic ways use singular value decomposition

### CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace

### BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree

### DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

### Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

### Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

### MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Master of Mathematical Finance: Course Descriptions

Master of Mathematical Finance: Course Descriptions CS 522 Data Mining Computer Science This course provides continued exploration of data mining algorithms. More sophisticated algorithms such as support

### ANTALYA INTERNATIONAL UNIVERSITY INDUSTRIAL ENGINEERING COURSE DESCRIPTIONS

ANTALYA INTERNATIONAL UNIVERSITY INDUSTRIAL ENGINEERING COURSE DESCRIPTIONS CORE COURSES MATH 101 - Calculus I Trigonometric functions and their basic properties. Inverse trigonometric functions. Logarithmic

### Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

### Lecture 19 Camera Matrices and Calibration

Lecture 19 Camera Matrices and Calibration Project Suggestions Texture Synthesis for In-Painting Section 10.5.1 in Szeliski Text Project Suggestions Image Stitching (Chapter 9) Face Recognition Chapter

### Big Data Means at Least Three Different Things. Michael Stonebraker

Big Data Means at Least Three Different Things. Michael Stonebraker The Meaning of Big Data - 3 V s Big Volume With simple (SQL) analytics With complex (non-sql) analytics Big Velocity Drink from a fire

### Face Recognition using Principle Component Analysis

Face Recognition using Principle Component Analysis Kyungnam Kim Department of Computer Science University of Maryland, College Park MD 20742, USA Summary This is the summary of the basic idea about PCA

### A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

### CS 207 - Data Science and Visualization Spring 2016

CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and human-assisted analysis of data sets. These

### Modélisation et résolutions numérique et symbolique

Modélisation et résolutions numérique et symbolique via les logiciels Maple et Matlab Jeremy Berthomieu Mohab Safey El Din Stef Graillat Mohab.Safey@lip6.fr Outline Previous course: partial review of what

### Introduction to Machine Learning

Introduction to Machine Learning CS 590 and STAT 598A, Spring 2010 Instructor: S.V. N. Vishwanathan (email: vishy) http://www.stat.purdue.edu/~vishy/introml/introml.html January 12, 2010 S.V. N. Vishwanathan

### ECE 468 / CS 519 Digital Image Processing. Introduction

ECE 468 / CS 519 Digital Image Processing Introduction Prof. Sinisa Todorovic sinisa@eecs.oregonstate.edu ECE 468: Digital Image Processing Instructor: Sinisa Todorovic sinisa@eecs.oregonstate.edu Office:

### Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School

### Using the Singular Value Decomposition

Using the Singular Value Decomposition Emmett J. Ientilucci Chester F. Carlson Center for Imaging Science Rochester Institute of Technology emmett@cis.rit.edu May 9, 003 Abstract This report introduces

1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures

### Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

### Advanced Computational Techniques in Engineering

Course Details Course Code: 5MWCI432 Level: M Tech CSE Semester: Semester 4 Instructor: Prof Dr S P Ghrera Advanced Computational Techniques in Engineering Outline Course Description: Data Distributions,

### Nonlinear Iterative Partial Least Squares Method

Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for

### Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of

### MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix. Nullspace Let A = (a ij ) be an m n matrix. Definition. The nullspace of the matrix A, denoted N(A), is the set of all n-dimensional column

### Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications

Data Mining In Modern Astronomy Sky Surveys: Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications Ching-Wa Yip cwyip@pha.jhu.edu; Bloomberg 518 Human are Great Pattern Recognizers

### Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

### November 28 th, Carlos Guestrin. Lower dimensional projections

PCA Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Instructor: Erik Sudderth Graduate TAs: Dae Il Kim & Ben Swanson Head Undergraduate TA: William Allen Undergraduate TAs: Soravit

### Numerical Algorithms Group. Embedded Analytics. A cure for the common code. www.nag.com. Results Matter. Trust NAG.

Embedded Analytics A cure for the common code www.nag.com Results Matter. Trust NAG. Executive Summary How much information is there in your data? How much is hidden from you, because you don t have access

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### 67 204 Mathematics for Business Analysis I Fall 2007

67 204 Mathematics for Business Analysis I Fall 2007 Instructor Asõkā Rāmanāyake Office: Swart 223 Office Hours: Monday 12:40 1:40 Wednesday 8:00 9:00 Thursday 9:10 11:20 If you cannot make my office hours,

### Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address

### Principal components analysis

CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains

### Unsupervised and supervised dimension reduction: Algorithms and connections

Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona

### Principal Component Analysis Application to images

Principal Component Analysis Application to images Václav Hlaváč Czech Technical University in Prague Faculty of Electrical Engineering, Department of Cybernetics Center for Machine Perception http://cmp.felk.cvut.cz/

### A Introduction to Matrix Algebra and Principal Components Analysis

A Introduction to Matrix Algebra and Principal Components Analysis Multivariate Methods in Education ERSH 8350 Lecture #2 August 24, 2011 ERSH 8350: Lecture 2 Today s Class An introduction to matrix algebra

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Linear Algebra Review Part 2: Ax=b

Linear Algebra Review Part 2: Ax=b Edwin Olson University of Michigan The Three-Day Plan Geometry of Linear Algebra Vectors, matrices, basic operations, lines, planes, homogeneous coordinates, transformations

### Numerical Analysis Introduction. Student Audience. Prerequisites. Technology.

Numerical Analysis Douglas Faires, Youngstown State University, (Chair, 2012-2013) Elizabeth Yanik, Emporia State University, (Chair, 2013-2015) Graeme Fairweather, Executive Editor, Mathematical Reviews,

### Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

### Text Analytics (Text Mining)

CSE 6242 / CX 4242 Apr 3, 2014 Text Analytics (Text Mining) LSI (uses SVD), Visualization Duen Horng (Polo) Chau Georgia Tech Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey

### Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

### Principal Component Analysis

Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

### Exploratory Data Analysis with MATLAB

Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton

### Linear Algebra Review. Vectors

Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length

### LINEAR SYSTEMS. Consider the following example of a linear system:

LINEAR SYSTEMS Consider the following example of a linear system: Its unique solution is x +2x 2 +3x 3 = 5 x + x 3 = 3 3x + x 2 +3x 3 = 3 x =, x 2 =0, x 3 = 2 In general we want to solve n equations in

### Numerical Solution of Linear Systems

Numerical Solution of Linear Systems Chen Greif Department of Computer Science The University of British Columbia Vancouver B.C. Tel Aviv University December 17, 2008 Outline 1 Direct Solution Methods

### Going Big in Data Dimensionality:

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

### Parallelism and Cloud Computing

Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication

### The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### by the matrix A results in a vector which is a reflection of the given

Eigenvalues & Eigenvectors Example Suppose Then So, geometrically, multiplying a vector in by the matrix A results in a vector which is a reflection of the given vector about the y-axis We observe that

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Report Paper: MatLab/Database Connectivity

Report Paper: MatLab/Database Connectivity Samuel Moyle March 2003 Experiment Introduction This experiment was run following a visit to the University of Queensland, where a simulation engine has been

### UF EDGE brings the classroom to you with online, worldwide course delivery!

What is the University of Florida EDGE Program? EDGE enables engineering professional, military members, and students worldwide to participate in courses, certificates, and degree programs from the UF

### 10.3 POWER METHOD FOR APPROXIMATING EIGENVALUES

55 CHAPTER NUMERICAL METHODS. POWER METHOD FOR APPROXIMATING EIGENVALUES In Chapter 7 we saw that the eigenvalues of an n n matrix A are obtained by solving its characteristic equation n c n n c n n...

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### arxiv:1007.5510v2 [stat.co] 19 Mar 2011

AN ALGORITHM FOR THE PRINCIPAL COMPONENT ANALYSIS OF LARGE DATA SETS NATHAN HALKO, PER-GUNNAR MARTINSSON, YOEL SHKOLNISKY, AND MARK TYGERT arxiv:1007.5510v2 [stat.co] 19 Mar 2011 Abstract. Recently popularized

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### 203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

### PTE505: Inverse Modeling for Subsurface Flow Data Integration (3 Units)

PTE505: Inverse Modeling for Subsurface Flow Data Integration (3 Units) Instructor: Behnam Jafarpour, Mork Family Department of Chemical Engineering and Material Science Petroleum Engineering, HED 313,

### SYMMETRIC EIGENFACES MILI I. SHAH

SYMMETRIC EIGENFACES MILI I. SHAH Abstract. Over the years, mathematicians and computer scientists have produced an extensive body of work in the area of facial analysis. Several facial analysis algorithms

### Lecture 5: Singular Value Decomposition SVD (1)

EEM3L1: Numerical and Analytical Techniques Lecture 5: Singular Value Decomposition SVD (1) EE3L1, slide 1, Version 4: 25-Sep-02 Motivation for SVD (1) SVD = Singular Value Decomposition Consider the system

### Scenario: Optimization of Conference Schedule.

MINI PROJECT 1 Scenario: Optimization of Conference Schedule. A conference has n papers accepted. Our job is to organize them in a best possible schedule. The schedule has p parallel sessions at a given

### Similar matrices and Jordan form

Similar matrices and Jordan form We ve nearly covered the entire heart of linear algebra once we ve finished singular value decompositions we ll have seen all the most central topics. A T A is positive

### Soft Clustering with Projections: PCA, ICA, and Laplacian

1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix

### High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

### MODULE 15 Clustering Large Datasets LESSON 34

MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data

### 8/29/2015. Data Mining and Machine Learning. Erich Seamon University of Idaho

Data Mining and Machine Learning Erich Seamon University of Idaho www.webpages.uidaho.edu/erichs erichs@uidaho.edu 1 I am NOMAD 2 1 Data Mining and Machine Learning Outline Outlining the data mining and