Nonnegative Matrix Factorization: Algorithms, Complexity and Applications

Similar documents
Can linear programs solve NP-hard problems?

3. Linear Programming and Polyhedral Combinatorics

Practical Guide to the Simplex Method of Linear Programming

. P. 4.3 Basic feasible solutions and vertices of polyhedra. x 1. x 2

Linear Programming. March 14, 2014

1 Solving LPs: The Simplex Algorithm of George Dantzig

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Similarity and Diagonalization. Similar Matrices

Actually Doing It! 6. Prove that the regular unit cube (say 1cm=unit) of sufficiently high dimension can fit inside it the whole city of New York.

Linear Algebra Notes

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Lecture 3: Finding integer solutions to systems of linear equations

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Linear Programming Problems

Applied Algorithm Design Lecture 5

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

160 CHAPTER 4. VECTOR SPACES

Mathematical finance and linear programming (optimization)

Transportation Polytopes: a Twenty year Update

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Least-Squares Intersection of Lines

Name: Section Registered In:

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Chapter 6. Orthogonality

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Nonlinear Programming Methods.S2 Quadratic Programming

Minimally Infeasible Set Partitioning Problems with Balanced Constraints

Solution of Linear Systems

Recall that two vectors in are perpendicular or orthogonal provided that their dot

1 VECTOR SPACES AND SUBSPACES

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS

Data Mining Yelp Data - Predicting rating stars from review text

Linear Programming I

4.6 Linear Programming duality

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Operation Count; Numerical Linear Algebra

Approximation Algorithms

Guessing Game: NP-Complete?

Linear Algebra Review. Vectors

Inner Product Spaces

Inference Methods for Analyzing the Hidden Semantics in Big Data. Phuong LE-HONG

3. Evaluate the objective function at each vertex. Put the vertices into a table: Vertex P=3x+2y (0, 0) 0 min (0, 5) 10 (15, 0) 45 (12, 2) 40 Max

Chapter 6. Cuboids. and. vol(conv(p ))

Lecture 2: August 29. Linear Programming (part I)

Systems of Linear Equations

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

by the matrix A results in a vector which is a reflection of the given

Chapter 6. Linear Programming: The Simplex Method. Introduction to the Big M Method. Section 4 Maximization and Minimization with Problem Constraints

A linear algebraic method for pricing temporary life annuities

Factoring nonnegative matrices with linear programs

University of Lille I PC first year list of exercises n 7. Review

Sensitivity analysis of utility based prices and risk-tolerance wealth processes

5.1 Bipartite Matching

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Solving Systems of Linear Equations

Linear Programming: Theory and Applications

THE DIMENSION OF A VECTOR SPACE

Equilibrium computation: Part 1

Introduction to Logic in Computer Science: Autumn 2006

Solutions to Math 51 First Exam January 29, 2015

What is Linear Programming?

x1 x 2 x 3 y 1 y 2 y 3 x 1 y 2 x 2 y 1 0.

Methods for Finding Bases

MAT 200, Midterm Exam Solution. a. (5 points) Compute the determinant of the matrix A =

Solving Linear Systems, Continued and The Inverse of a Matrix

LECTURE 5: DUALITY AND SENSITIVITY ANALYSIS. 1. Dual linear program 2. Duality theory 3. Sensitivity analysis 4. Dual simplex method

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

Notes on Determinant

Standard Form of a Linear Programming Problem

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

8 Square matrices continued: Determinants

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

NOTES ON LINEAR TRANSFORMATIONS

Introduction to Matrix Algebra

Factoring nonnegative matrices with linear programs

4.5 Linear Dependence and Linear Independence

Continued Fractions and the Euclidean Algorithm

Notes on Symmetric Matrices

Linear Maps. Isaiah Lankham, Bruno Nachtergaele, Anne Schilling (February 5, 2007)

Recall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.

Introduction to Linear Programming (LP) Mathematical Programming (MP) Concept

CURVE FITTING LEAST SQUARES APPROXIMATION

Discuss the size of the instance for the minimum spanning tree problem.

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

This exposition of linear programming

Lecture 5 Principal Minors and the Hessian

Adaptive Linear Programming Decoding

Lecture 4: Partitioned Matrices and Determinants

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

arxiv: v2 [math.oc] 2 Feb 2013

FACTORING SPARSE POLYNOMIALS

Factorization Theorems

Modélisation et résolutions numérique et symbolique

Linear Programming. April 12, 2005

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Transcription:

Nonnegative Matrix Factorization: Algorithms, Complexity and Applications Ankur Moitra Massachusetts Institute of Technology July 6th, 2015 Ankur Moitra (MIT) NMF July 6th, 2015

m M n

W m M = A inner dimension n

rank W m M = A inner dimension n

rank non negative W m M = A inner dimension n non negative

rank non negative non negative W m M = A inner dimension n non negative

Equivalent Definitions The nonnegative rank rank + (M) can be defined many ways:

Equivalent Definitions The nonnegative rank rank + (M) can be defined many ways: The smallest r such that there is a factorization M = AW where A and W are nonnegative and have inner-dimension r

Equivalent Definitions The nonnegative rank rank + (M) can be defined many ways: The smallest r such that there is a factorization M = AW where A and W are nonnegative and have inner-dimension r The smallest r such that there are r nonnegative vectors A 1, A 2,...A r and every column in M can be represented as a nonnegative combination of them

Equivalent Definitions The nonnegative rank rank + (M) can be defined many ways: The smallest r such that there is a factorization M = AW where A and W are nonnegative and have inner-dimension r The smallest r such that there are r nonnegative vectors A 1, A 2,...A r and every column in M can be represented as a nonnegative combination of them The smallest r such that M = r i=1 M (i) where each M (i) is rank one and nonnegative

Equivalent Definitions The nonnegative rank rank + (M) can be defined many ways: The smallest r such that there is a factorization M = AW where A and W are nonnegative and have inner-dimension r The smallest r such that there are r nonnegative vectors A 1, A 2,...A r and every column in M can be represented as a nonnegative combination of them The smallest r such that M = r i=1 M (i) where each M (i) is rank one and nonnegative rank(m) rank + (M), but it can be much larger too!

Information Retrieval documents:

Information Retrieval documents: m M n

Information Retrieval documents: W m M = A inner dimension n

Information Retrieval documents: "topics" W m M = A inner dimension n

Information Retrieval documents: "topics" "representation" W m M = A inner dimension n

Information Retrieval documents: "topics" "representation" W m M = A inner dimension n

Information Retrieval documents: "topics" "representation" W m M = A inner dimension n

Applications Statistics and Machine Learning: extract latent relationships in data image segmentation, text classification, information retrieval, collaborative filtering,... [Lee, Seung], [Xu et al], [Hofmann], [Kumar et al], [Kleinberg, Sandler]

Applications Statistics and Machine Learning: extract latent relationships in data image segmentation, text classification, information retrieval, collaborative filtering,... [Lee, Seung], [Xu et al], [Hofmann], [Kumar et al], [Kleinberg, Sandler] Combinatorics: extended formulation, log-rank conjecture [Yannakakis], [Lovász, Saks]

Applications Statistics and Machine Learning: extract latent relationships in data image segmentation, text classification, information retrieval, collaborative filtering,... [Lee, Seung], [Xu et al], [Hofmann], [Kumar et al], [Kleinberg, Sandler] Combinatorics: extended formulation, log-rank conjecture [Yannakakis], [Lovász, Saks] Physical Modeling: interaction of components is additive visual recognition, environmetrics

Local Search: Given A, compute W, compute A,...

Local Search: Given A, compute W, compute A,... Known to fail on worst-case inputs (stuck in local minima)

Local Search: Given A, compute W, compute A,... Known to fail on worst-case inputs (stuck in local minima) Highly sensitive to cost function, regularization, update procedure

Local Search: Given A, compute W, compute A,... Known to fail on worst-case inputs (stuck in local minima) Highly sensitive to cost function, regularization, update procedure Question Are there efficient provable algorithms to compute the nonnegative rank?

Local Search: Given A, compute W, compute A,... Known to fail on worst-case inputs (stuck in local minima) Highly sensitive to cost function, regularization, update procedure Question Are there efficient provable algorithms to compute the nonnegative rank? Can it even be computed in finite time?

Local Search: Given A, compute W, compute A,... Known to fail on worst-case inputs (stuck in local minima) Highly sensitive to cost function, regularization, update procedure Question Are there efficient provable algorithms to compute the nonnegative rank? Can it even be computed in finite time? Theorem (Cohen, Rothblum) There is an exact algorithm (based on solving systems of polynomial inequalities) that runs in time exponential in n, m and r

Hardness of NMF Theorem (Vavasis) NMF is NP-hard to compute

Hardness of NMF Theorem (Vavasis) NMF is NP-hard to compute Hence it is unlikely that there is an exact algorithm that runs in time polynomial in n, m and r

Hardness of NMF Theorem (Vavasis) NMF is NP-hard to compute Hence it is unlikely that there is an exact algorithm that runs in time polynomial in n, m and r Question Should we expect r to be large?

Hardness of NMF Theorem (Vavasis) NMF is NP-hard to compute Hence it is unlikely that there is an exact algorithm that runs in time polynomial in n, m and r Question Should we expect r to be large? What if you gave me a collection of 100 documents, and I told you there are 75 topics?

Hardness of NMF Theorem (Vavasis) NMF is NP-hard to compute Hence it is unlikely that there is an exact algorithm that runs in time polynomial in n, m and r Question Should we expect r to be large? What if you gave me a collection of 100 documents, and I told you there are 75 topics? How quickly can we solve NMF if r is small?

The Worst-Case Complexity of NMF Theorem (Arora, Ge, Kannan, Moitra) There is an (nm) O(2r r 2) time exact algorithm for NMF

The Worst-Case Complexity of NMF Theorem (Arora, Ge, Kannan, Moitra) There is an (nm) O(2r r 2) time exact algorithm for NMF Previously, the fastest (provable) algorithm for r = 4 ran in time exponential in n and m

The Worst-Case Complexity of NMF Theorem (Arora, Ge, Kannan, Moitra) There is an (nm) O(2r r 2) time exact algorithm for NMF Previously, the fastest (provable) algorithm for r = 4 ran in time exponential in n and m Can we improve the exponential dependence on r?

The Worst-Case Complexity of NMF Theorem (Arora, Ge, Kannan, Moitra) There is an (nm) O(2r r 2) time exact algorithm for NMF Previously, the fastest (provable) algorithm for r = 4 ran in time exponential in n and m Can we improve the exponential dependence on r? Theorem (Arora, Ge, Kannan, Moitra) An exact algorithm for NMF that runs in time (nm) o(r) would yield a sub-exponential time algorithm for 3-SAT

An Almost Optimal Algorithm Theorem (Moitra) There is an (2 r nm) O(r 2) time exact algorithm for NMF

An Almost Optimal Algorithm Theorem (Moitra) There is an (2 r nm) O(r 2) time exact algorithm for NMF These algorithms are based on methods for variable reduction

An Almost Optimal Algorithm Theorem (Moitra) There is an (2 r nm) O(r 2) time exact algorithm for NMF These algorithms are based on methods for variable reduction Question How many variables are needed to express a decision problem as a feasibility problem for a system of polynomial inequalities?

An Almost Optimal Algorithm Theorem (Moitra) There is an (2 r nm) O(r 2) time exact algorithm for NMF These algorithms are based on methods for variable reduction Question How many variables are needed to express a decision problem as a feasibility problem for a system of polynomial inequalities? Open Question Is the true complexity of NMF exponential in r 2 or r?

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Is NMF Computable?

Is NMF Computable? [Cohen and Rothblum]: Yes

Is NMF Computable? [Cohen and Rothblum]: Yes (DETOUR)

Semi-algebraic sets: s polynomials, k variables, Boolean function B S = {x 1, x 2...x k B(sgn(f 1 ), sgn(f 2 ),...sgn(f s )) = true }

Semi-algebraic sets: s polynomials, k variables, Boolean function B Question S = {x 1, x 2...x k B(sgn(f 1 ), sgn(f 2 ),...sgn(f s )) = true } How many sign patterns arise (as x 1, x 2,...x k range over R k )?

Semi-algebraic sets: s polynomials, k variables, Boolean function B Question S = {x 1, x 2...x k B(sgn(f 1 ), sgn(f 2 ),...sgn(f s )) = true } How many sign patterns arise (as x 1, x 2,...x k range over R k )? Naive bound: 3 s (all of { 1, 0, 1} s ),

Semi-algebraic sets: s polynomials, k variables, Boolean function B Question S = {x 1, x 2...x k B(sgn(f 1 ), sgn(f 2 ),...sgn(f s )) = true } How many sign patterns arise (as x 1, x 2,...x k range over R k )? Naive bound: 3 s (all of { 1, 0, 1} s ), [Milnor, Warren]: at most (ds) k, where d is the maximum degree

Semi-algebraic sets: s polynomials, k variables, Boolean function B Question S = {x 1, x 2...x k B(sgn(f 1 ), sgn(f 2 ),...sgn(f s )) = true } How many sign patterns arise (as x 1, x 2,...x k range over R k )? Naive bound: 3 s (all of { 1, 0, 1} s ), [Milnor, Warren]: at most (ds) k, where d is the maximum degree In fact, best known algorithms (e.g. [Renegar]) for finding a point in S run in (ds) O(k) time

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR)

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR) Variables: entries in A and W (nr + mr total)

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR) Variables: entries in A and W (nr + mr total) Constraints: A, W 0 and AW = M (degree two)

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR) Variables: entries in A and W (nr + mr total) Constraints: A, W 0 and AW = M (degree two) Running time for a solver is exponential in the number of variables

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR) Variables: entries in A and W (nr + mr total) Constraints: A, W 0 and AW = M (degree two) Running time for a solver is exponential in the number of variables Question What is the smallest formulation, measured in the number of variables?

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR) Variables: entries in A and W (nr + mr total) Constraints: A, W 0 and AW = M (degree two) Running time for a solver is exponential in the number of variables Question What is the smallest formulation, measured in the number of variables? Can we use only f (r) variables?

Is NMF Computable? [Cohen, Rothblum]: Yes (DETOUR) Variables: entries in A and W (nr + mr total) Constraints: A, W 0 and AW = M (degree two) Running time for a solver is exponential in the number of variables Question What is the smallest formulation, measured in the number of variables? Can we use only f (r) variables? O(r 2 ) variables?

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Easy Case: A has Full Column Rank (AGKM) A

Easy Case: A has Full Column Rank (AGKM) A + pseudo inverse A

Easy Case: A has Full Column Rank (AGKM) A + pseudo inverse A = I r

Easy Case: A has Full Column Rank (AGKM) A + pseudo inverse W A = W

Easy Case: A has Full Column Rank (AGKM) A + pseudo inverse W A = M W

Easy Case: A has Full Column Rank (AGKM) linearly independent A + pseudo inverse W A = M W

Easy Case: A has Full Column Rank (AGKM) linearly independent A + pseudo inverse W A = M W change of basis T M

Putting it Together: Easy Case M

Putting it Together: Easy Case M M M

Putting it Together: Easy Case T M M variables S M

Putting it Together: Easy Case T M M variables W S M A

Putting it Together: Easy Case T M M variables W non negative? S M A non negative?

Putting it Together: Easy Case T M M variables W non negative? S M A ( ) ( ) non negative?

Putting it Together: Easy Case T M M variables W non negative? M S A? ( ) ( ) = M non negative?

General Structure Theorem Question What if A does not have full column rank?

General Structure Theorem Question What if A does not have full column rank? Approach: guess many linear transformations, one for each pseudo-inverse of a set of linearly independent columns

A 2 A 1 A 4 A 3 M

A 2 A 1 A 4 A 3 M

A 2 A 1 A 4 A 3 M T 1 = ( 2 ) + A 1 A A 3

A 2 A 1 A 4 A 3 M T 1 T 2 = ( ) + A 1 A 2 A 3 = ( + A2A 3 A 4)

A 2 A 1 A 4 A 3 M T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

General Structure Theorem Question What if A does not have full column rank? Approach: guess many linear transformations, one for each pseudo-inverse of a set of linearly independent columns

General Structure Theorem Question What if A does not have full column rank? Approach: guess many linear transformations, one for each pseudo-inverse of a set of linearly independent columns Problem Which linear transformation should we use (for a column of M)?

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 2 A 3 ) + A 1 = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 2 3) + A 1 A = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 2 3) + A 1 A = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 1 A 2 A 3 ) + = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 1 A 2 A 3 ) + = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 1 A 2 A 3 ) + = ( ) + A 2 A 3 A 4

A 2 A 1 A 4 A 3 M exponentially many choices? T 1 T 2 = ( A 1 A 2 A 3 ) + = ( ) + A 2 A 3 A 4 polynomially many choices

General Structure Theorem Question What if A does not have full column rank? Approach: guess many linear transformations, one for each pseudo-inverse of a set of linearly independent columns Problem Which linear transformation should we use (for a column of M)?

General Structure Theorem Question What if A does not have full column rank? Approach: guess many linear transformations, one for each pseudo-inverse of a set of linearly independent columns Problem Which linear transformation should we use (for a column of M)? Observation Which linear transformation depends on a partition of space defined by at most 2 r r 2 half spaces in r dimensional space

Putting it Together: The General Case We can brute force search over all such decision rules, and solve a system of polynomial equations for each

Putting it Together: The General Case We can brute force search over all such decision rules, and solve a system of polynomial equations for each Claim (Soundness) If any resulting system has a solution, it yields a nonnegative matrix factorization

Putting it Together: The General Case We can brute force search over all such decision rules, and solve a system of polynomial equations for each Claim (Soundness) If any resulting system has a solution, it yields a nonnegative matrix factorization Claim (Completeness) If there is a nonnegative matrix factorization, at least one of the systems will have a solution

Algebraic Interpretation Theorem (Arora, Ge, Kannan, Moitra) Given a nonnegative matrix M and a target r there is a one-to-many reduction to a set of (nm) 2r r 2 systems of polynomial inequalities each in 2 r r 2 variables

Algebraic Interpretation Theorem (Arora, Ge, Kannan, Moitra) Given a nonnegative matrix M and a target r there is a one-to-many reduction to a set of (nm) 2r r 2 systems of polynomial inequalities each in 2 r r 2 variables This reduction is (essentially) based on covering the convex hull of the columns of A by 2 r simplices

Algebraic Interpretation Theorem (Arora, Ge, Kannan, Moitra) Given a nonnegative matrix M and a target r there is a one-to-many reduction to a set of (nm) 2r r 2 systems of polynomial inequalities each in 2 r r 2 variables This reduction is (essentially) based on covering the convex hull of the columns of A by 2 r simplices Is there a more efficient reduction that uses fewer variables?

Algebraic Interpretation Theorem (Arora, Ge, Kannan, Moitra) Given a nonnegative matrix M and a target r there is a one-to-many reduction to a set of (nm) 2r r 2 systems of polynomial inequalities each in 2 r r 2 variables This reduction is (essentially) based on covering the convex hull of the columns of A by 2 r simplices Is there a more efficient reduction that uses fewer variables? Problem If A does not have full column rank, we may need as many as 2r r+1 simplices to cover it (and only it)

The Cross Polytope Let K r = {x s.t. r i=1 x i 1} = conv{±e i } i.e.

The Cross Polytope Let K r = {x s.t. r i=1 x i 1} = conv{±e i } i.e. Fact K r has 2r vertices and 2 r facets (it is dual to the hypercube)

Question If we want to cover K r with simplices (but cover nothing else) how many simplices do we need?

Question If we want to cover K r with simplices (but cover nothing else) how many simplices do we need? 2 Each simplex covers at most r + 1 facets, so we need at least r simples r+1

Question If we want to cover K r with simplices (but cover nothing else) how many simplices do we need? 2 Each simplex covers at most r + 1 facets, so we need at least r simples In our setting, this means we really do need exponentially many linear transformations to recover the factorization from our input r+1

Question If we want to cover K r with simplices (but cover nothing else) how many simplices do we need? 2 Each simplex covers at most r + 1 facets, so we need at least r simples In our setting, this means we really do need exponentially many linear transformations to recover the factorization from our input However these linear transformations are algebraically dependent! r+1

Fact (Cramer s Rule) If R is invertible, the entries of R 1 are ratios of polynomials of entries in R

Fact (Cramer s Rule) If R is invertible, the entries of R 1 are ratios of polynomials of entries in R More precisely, (R 1 ) i,j = det(r i,j ) det(r)

Fact (Cramer s Rule) If R is invertible, the entries of R 1 are ratios of polynomials of entries in R More precisely, (R 1 ) i,j = det(r i,j ) det(r) Claim We can express the entries of the 2 r unknown linear transformations as ratios of polynomials in the unknown entries of any invertible submatrix of A and W.

Fact (Cramer s Rule) If R is invertible, the entries of R 1 are ratios of polynomials of entries in R More precisely, (R 1 ) i,j = det(r i,j ) det(r) Claim We can express the entries of the 2 r unknown linear transformations as ratios of polynomials in the unknown entries of any invertible submatrix of A and W. Thus we only need 2r 2 variables to express the unknown linear transformations

Summary of Variable Reduction How many variables do we need to express the decision problem rank + (M) r?

Summary of Variable Reduction How many variables do we need to express the decision problem rank + (M) r? [Cohen, Rothblum]: mr + nr

Summary of Variable Reduction How many variables do we need to express the decision problem rank + (M) r? [Cohen, Rothblum]: mr + nr [Arora, Ge, Kannan, Moitra]: 2r 2 2 r

Summary of Variable Reduction How many variables do we need to express the decision problem rank + (M) r? [Cohen, Rothblum]: mr + nr [Arora, Ge, Kannan, Moitra]: 2r 2 2 r [Moitra]: 2r 2

Summary of Variable Reduction How many variables do we need to express the decision problem rank + (M) r? [Cohen, Rothblum]: mr + nr [Arora, Ge, Kannan, Moitra]: 2r 2 2 r [Moitra]: 2r 2 Corollary There is an (2 r nm) O(r 2) time algorithm for NMF

Summary of Variable Reduction How many variables do we need to express the decision problem rank + (M) r? [Cohen, Rothblum]: mr + nr [Arora, Ge, Kannan, Moitra]: 2r 2 2 r [Moitra]: 2r 2 Corollary There is an (2 r nm) O(r 2) time algorithm for NMF Are there other applications of variable reduction in geometric problems?

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Other Algorithms? Question Are there other algorithms (perhaps non-algebraic) that can solve NMF much faster?

Other Algorithms? Question Are there other algorithms (perhaps non-algebraic) that can solve NMF much faster? Probably not, under standard complexity assumptions

Other Algorithms? Question Are there other algorithms (perhaps non-algebraic) that can solve NMF much faster? Probably not, under standard complexity assumptions Theorem (Vavasis) NMF is NP-hard to compute

Other Algorithms? Question Are there other algorithms (perhaps non-algebraic) that can solve NMF much faster? Probably not, under standard complexity assumptions Theorem (Vavasis) NMF is NP-hard to compute Open Question Is NMF NP-hard when restricted to Boolean matrices?

Intermediate Simplex Problem:

Intermediate Simplex Problem: Input: polytopes P Q where P is encoded by its vertices and Q is encoded by its facets.

Intermediate Simplex Problem: Input: polytopes P Q where P is encoded by its vertices and Q is encoded by its facets. Is there a simplex K such that P K Q?

Intermediate Simplex Problem: Input: polytopes P Q where P is encoded by its vertices and Q is encoded by its facets. Is there a simplex K such that P K Q? Theorem (Vavasis) The intermediate simplex problem and a special case of nonnegative matrix factorization are polynomial time inter-reducible

Intermediate Simplex Problem: Input: polytopes P Q where P is encoded by its vertices and Q is encoded by its facets. Is there a simplex K such that P K Q? Theorem (Vavasis) The intermediate simplex problem and a special case of nonnegative matrix factorization are polynomial time inter-reducible How can we construct geometric gadgets that encode SAT?

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Vavasis:

Vavasis: place three points!

Each gadget represents a variable and forces it to be true or false

Each gadget represents a variable and forces it to be true or false But this approach requires r to be large (at least the number of variables in a SAT formula).

Each gadget represents a variable and forces it to be true or false But this approach requires r to be large (at least the number of variables in a SAT formula). Question Are there lower dimensional gadgets that can be used to encode hard problems?

Each gadget represents a variable and forces it to be true or false But this approach requires r to be large (at least the number of variables in a SAT formula). Question Are there lower dimensional gadgets that can be used to encode hard problems? d-sum Problem:

Each gadget represents a variable and forces it to be true or false But this approach requires r to be large (at least the number of variables in a SAT formula). Question Are there lower dimensional gadgets that can be used to encode hard problems? d-sum Problem: Input: n numbers s 1, s 2,...s n R

Each gadget represents a variable and forces it to be true or false But this approach requires r to be large (at least the number of variables in a SAT formula). Question Are there lower dimensional gadgets that can be used to encode hard problems? d-sum Problem: Input: n numbers s 1, s 2,...s n R Is there a set of d distinct numbers U [n] where i U s i = 0?

Hardness for d-sum The best known algorithms for d-sum run in time n d/2

Hardness for d-sum The best known algorithms for d-sum run in time n d/2 Theorem (Patrascu, Williams) If d-sum can be solved in time n o(d) then there are subexponential time algorithms for 3-SAT.

Hardness for d-sum The best known algorithms for d-sum run in time n d/2 Theorem (Patrascu, Williams) If d-sum can be solved in time n o(d) then there are subexponential time algorithms for 3-SAT. Conjecture (Impagliazzo, Paturi) 3-SAT on n variables cannot be solved in time 2 o(n)

Hardness for d-sum The best known algorithms for d-sum run in time n d/2 Theorem (Patrascu, Williams) If d-sum can be solved in time n o(d) then there are subexponential time algorithms for 3-SAT. Conjecture (Impagliazzo, Paturi) 3-SAT on n variables cannot be solved in time 2 o(n) Can we use d-sum as our source of hardness?

Vavasis: place three points!

Vavasis: d SUM gadget: place three points!

Vavasis: d SUM gadget: eps place three points!

Vavasis: d SUM gadget: eps place three points!

Vavasis: d SUM gadget: eps s 1 s 2 s 3 place three points!

Vavasis: d SUM gadget: eps s 1 s 2 s 3 place three points!

Summary of Hardness Results Theorem (Vavasis) NMF is NP-hard

Summary of Hardness Results Theorem (Vavasis) NMF is NP-hard Theorem (Arora, Ge, Kannan, Moitra) An algorithm for NMF that runs in time (nm) o(r) would yield a sub-exponential time algorithm for 3-SAT

Summary of Hardness Results Theorem (Vavasis) NMF is NP-hard Theorem (Arora, Ge, Kannan, Moitra) An algorithm for NMF that runs in time (nm) o(r) would yield a sub-exponential time algorithm for 3-SAT Open Question Is it hard to solve NMF in time (nm) o(r 2)?

Beyond Worst-Case Analysis We (essentially) resolved the worst-case complexity of NMF, but there are still paths to obtaining better algorithms

Beyond Worst-Case Analysis We (essentially) resolved the worst-case complexity of NMF, but there are still paths to obtaining better algorithms Question What distinguishes a realistic instance of NMF from an artificial one?

Beyond Worst-Case Analysis We (essentially) resolved the worst-case complexity of NMF, but there are still paths to obtaining better algorithms Question What distinguishes a realistic instance of NMF from an artificial one? Question Are there practical algorithms with provable guarantees?

Beyond Worst-Case Analysis We (essentially) resolved the worst-case complexity of NMF, but there are still paths to obtaining better algorithms Question What distinguishes a realistic instance of NMF from an artificial one? Question Are there practical algorithms with provable guarantees? Recall, in some applications columns in A represent distributions on words (topics) and columns in W represent distributions on topics

[Donoho, Stodden]: A is separable if each column i has an unknown row π(i) whose only non-zero is in column i

A

A

[Donoho, Stodden]: A is separable if each column i has an unknown row π(i) whose only non-zero is in column i

[Donoho, Stodden]: A is separable if each column i has an unknown row π(i) whose only non-zero is in column i Interpretation: We call this row an anchor word, and any document that contains this word is very likely to be (at least partially) about the corresponding topic

[Donoho, Stodden]: A is separable if each column i has an unknown row π(i) whose only non-zero is in column i Interpretation: We call this row an anchor word, and any document that contains this word is very likely to be (at least partially) about the corresponding topic e.g. personal finance 401k, baseball bunt,...

[Donoho, Stodden]: A is separable if each column i has an unknown row π(i) whose only non-zero is in column i Interpretation: We call this row an anchor word, and any document that contains this word is very likely to be (at least partially) about the corresponding topic e.g. personal finance 401k, baseball bunt,... Question Separability was introduced to understand when NMF is unique Is it enough to make NMF easy?

A W M =

A W M =

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic Observation Rows of W appear as (scaled) rows of M

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic Observation Rows of W appear as (scaled) rows of M Question How can we identify anchor words?

A W M =

A W M = brute force: n r

A W M = brute force: n r

A W M = brute force: n r

A W M = brute force: n r

A W M = brute force: n r

A W M = brute force: n r

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic Observation Rows of W appear as (scaled) rows of M Question How can we identify anchor words?

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic Observation Rows of W appear as (scaled) rows of M Question How can we identify anchor words? Removing a row from M strictly changes the convex hull iff it is an anchor word

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic Observation Rows of W appear as (scaled) rows of M Question How can we identify anchor words? Removing a row from M strictly changes the convex hull iff it is an anchor word Hence we can identify all the anchor words via linear programming

Separable Instances Recall: For each topic, there is some (anchor) word that only appears in this topic Observation Rows of W appear as (scaled) rows of M Question How can we identify anchor words? Removing a row from M strictly changes the convex hull iff it is an anchor word Hence we can identify all the anchor words via linear programming (can be made robust to noise)

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Outline Introduction Algebraic Algorithms for NMF Systems of Polynomial Inequalities Methods for Variable Reduction Hardness Results Intermediate Simplex Problem Geometric Gadgets Separable NMF The Anchor Words Algorithm Applications to Topic Modeling

Topic Models Topic matrix A, generate W stochastically

Topic Models Topic matrix A, generate W stochastically For each document (column in M = AW ) sample N words

Topic Models Topic matrix A, generate W stochastically For each document (column in M = AW ) sample N words e.g. Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), Pachinko Allocation Model (PAM),...

Topic Models Topic matrix A, generate W stochastically For each document (column in M = AW ) sample N words e.g. Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), Pachinko Allocation Model (PAM),... Question Can we estimate A, given random samples from M?

Topic Models Topic matrix A, generate W stochastically For each document (column in M = AW ) sample N words e.g. Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), Pachinko Allocation Model (PAM),... Question Can we estimate A, given random samples from M? Yes! [Arora, Ge, Moitra] we give a provable algorithm based on (noise-tolerant) NMF;

Topic Models Topic matrix A, generate W stochastically For each document (column in M = AW ) sample N words e.g. Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), Pachinko Allocation Model (PAM),... Question Can we estimate A, given random samples from M? Yes! [Arora, Ge, Moitra] we give a provable algorithm based on (noise-tolerant) NMF; see also [Anandkumar et al]

Topic Models Topic matrix A, generate W stochastically For each document (column in M = AW ) sample N words e.g. Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), Pachinko Allocation Model (PAM),... Question Can we estimate A, given random samples from M? Yes! [Arora, Ge, Moitra] we give a provable algorithm based on (noise-tolerant) NMF; see also [Anandkumar et al] Danger: I am about to show you experimental results

Running on Real Data joint work with Arora, Ge, Halpern, Mimno, Sontag, Wu and Zhu We ran our algorithm on a database of 300,000 New York Times articles (from the UCI database) with 30,000 distinct words

Running on Real Data joint work with Arora, Ge, Halpern, Mimno, Sontag, Wu and Zhu We ran our algorithm on a database of 300,000 New York Times articles (from the UCI database) with 30,000 distinct words Run time: 12 minutes

Running on Real Data joint work with Arora, Ge, Halpern, Mimno, Sontag, Wu and Zhu We ran our algorithm on a database of 300,000 New York Times articles (from the UCI database) with 30,000 distinct words Run time: 12 minutes (compared to 10 hours for MALLET and other state-of-the-art topic modeling tools)

Running on Real Data joint work with Arora, Ge, Halpern, Mimno, Sontag, Wu and Zhu We ran our algorithm on a database of 300,000 New York Times articles (from the UCI database) with 30,000 distinct words Run time: 12 minutes (compared to 10 hours for MALLET and other state-of-the-art topic modeling tools) Topics are high quality

Running on Real Data joint work with Arora, Ge, Halpern, Mimno, Sontag, Wu and Zhu We ran our algorithm on a database of 300,000 New York Times articles (from the UCI database) with 30,000 distinct words Run time: 12 minutes (compared to 10 hours for MALLET and other state-of-the-art topic modeling tools) Topics are high quality (because we can handle correlations in the topic model?)

Running on Real Data joint work with Arora, Ge, Halpern, Mimno, Sontag, Wu and Zhu We ran our algorithm on a database of 300,000 New York Times articles (from the UCI database) with 30,000 distinct words Run time: 12 minutes (compared to 10 hours for MALLET and other state-of-the-art topic modeling tools) Topics are high quality (because we can handle correlations in the topic model?) Let s check out the results!

Concluding Remarks Our algorithms are based on answering a purely algebraic question: How many variables do we need in a semi-algebraic set to encode nonnegative rank?

Concluding Remarks Our algorithms are based on answering a purely algebraic question: How many variables do we need in a semi-algebraic set to encode nonnegative rank? Question Are there other examples of a better understanding of the expressive power of semi-algebraic sets can lead to new algorithms?

Concluding Remarks Our algorithms are based on answering a purely algebraic question: How many variables do we need in a semi-algebraic set to encode nonnegative rank? Question Are there other examples of a better understanding of the expressive power of semi-algebraic sets can lead to new algorithms? Observation The number of variables plays an analogous role to VC-dimension

Concluding Remarks Our algorithms are based on answering a purely algebraic question: How many variables do we need in a semi-algebraic set to encode nonnegative rank? Question Are there other examples of a better understanding of the expressive power of semi-algebraic sets can lead to new algorithms? Observation The number of variables plays an analogous role to VC-dimension Is there an elementary proof of the Milnor-Warren bound?

Any Questions?

Thanks!