Tackling Big Data with Tensor Methods



Similar documents
Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Machine Learning for Data Science (CS4786) Lecture 1

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Social Media Mining. Data Mining Essentials

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Data Mining Techniques

: Introduction to Machine Learning Dr. Rita Osadchy

Statistical Models in Data Mining

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Time series experiments

I Want To Start A Business : Getting Recommendation on Starting New Businesses Based on Yelp Data

MS1b Statistical Data Mining

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data, Measurements, Features

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

An Introduction to Data Mining

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Using Data Mining for Mobile Communication Clustering and Characterization

Course: Model, Learning, and Inference: Lecture 5

Character Image Patterns as Big Data

Introduction to Data Mining

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Clustering Connectionist and Statistical Language Processing

Chapter 7. Cluster Analysis

Probabilistic Latent Semantic Analysis (plsa)

How To Cluster

Classification algorithm in Data mining: An Overview

SOCIAL NETWORK ANALYSIS EVALUATING THE CUSTOMER S INFLUENCE FACTOR OVER BUSINESS EVENTS

The Data Mining Process

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Cluster Analysis: Advanced Concepts

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Machine Learning for natural language processing

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Neural Networks Lesson 5 - Cluster Analysis

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

EXPLORING SPATIAL PATTERNS IN YOUR DATA

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

OUTLIER ANALYSIS. Data Mining 1

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Information Retrieval and Web Search Engines

Social Network Mining

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Statistics Graduate Courses

Multidimensional data analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning Introduction

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Neuroimaging module I: Modern neuroimaging methods of investigation of the human brain in health and disease

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Categorical Data Visualization and Clustering Using Subjective Factors

Visualization Software

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A comparison of various clustering methods and algorithms in data mining

Dissertation TOPIC MODELS FOR IMAGE RETRIEVAL ON LARGE-SCALE DATABASES. Eva Hörster

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Machine Learning Logistic Regression

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Mixtures of Robust Probabilistic Principal Component Analyzers

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Visualization of large data sets using MDS combined with LVQ.

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Information Management course

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining and Machine Learning in Bioinformatics

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

A Statistical Text Mining Method for Patent Analysis

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Using multiple models: Bagging, Boosting, Ensembles, Forests

Towards better accuracy for Spam predictions

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Principles of Data Mining by Hand&Mannila&Smyth

Transcription:

Tackling Big Data with Tensor Methods Anima Anandkumar U.C. Irvine

Learning with Big Data

Data vs. Information

Data vs. Information

Data vs. Information Missing observations, gross corruptions, outliers.

Data vs. Information Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables!

Data vs. Information Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables! Data deluge also a data desert!

Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack

Learning in High Dimensional Regime Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures?

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data.

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical.

How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. Advanced: Probabilistic models Hidden variables have more general distributions. Can model mixed membership/hierarchical groups. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.

Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures).

Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.

Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of actors.

Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.

Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.

Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups

Learning Algorithms through Tensor Factorization vs. Co-occurrence of three-words in a document, e.g. [apple, orange, banana]. Tensor Eigenvectors 1 Can learn the hidden topics by finding tensor eigenvectors. Common friends (neighbors) of triplets of nodes in a social networks. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31

Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank Category Business Stars Review Counts 1 Latin American Salvadoreno Restaurant 4.0 36 2 Gluten Free P.F. Chang s China Bistro 3.5 55 3 Hobby Shops Make Meaning 4.5 14 4 Mass Media KJZZ 91.5FM 4.0 13 5 Yoga Sutra Midtown 4.5 31 Bridgeness: Distance from vector [1/ˆk,...,1/ˆk] Top-5 bridging nodes (businesses) Business Four Peaks Brewing Pizzeria Bianco FEZ Matt s Big Breakfast Cornish Pasty Co Categories Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Restaurants, Pizza, Phoenix Restaurants, Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Restaurants, Phoenix, Breakfast& Brunch Restaurants, Bars, Nightlife, Pubs, Tempe

My Research Group and Resources Furong Huang Majid Janzamin Hanie Sedghi Niranjan UN Forough Arabshahi ML summer school lectures available at http://newport.eecs.uci.edu/anandkumar/mlss.html