Big Data, Machine Learning, Causal Models



Similar documents
Supervised Learning (Big Data Analytics)

Course: Model, Learning, and Inference: Lecture 5

The Basics of Graphical Models

STA 4273H: Statistical Machine Learning

Statistical Models in Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

Statistical Challenges with Big Data in Management Science

Social Media Mining. Data Mining Essentials

Principles of Data Mining by Hand&Mannila&Smyth

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Bayesian networks - Time-series models - Apache Spark & Scala

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Simple Linear Regression Inference

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Chapter 12 Discovering New Knowledge Data Mining

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Learning is a very general term denoting the way in which agents:

Basics of Statistical Machine Learning

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

An Introduction to Data Mining

Multisensor Data Fusion and Applications

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Compression algorithm for Bayesian network modeling of binary systems

Decision Trees and Networks

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Introduction to Learning & Decision Trees

Cell Phone based Activity Detection using Markov Logic Network

Data Preprocessing. Week 2

Learning outcomes. Knowledge and understanding. Competence and skills

Statistics for BIG data

Machine Learning Overview

Role of Automation in the Forensic Examination of Handwritten Items

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Bayesian Networks. Mausam (Slides by UW-AI faculty)

3. The Junction Tree Algorithms

MTH 140 Statistics Videos

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

8. Machine Learning Applied Artificial Intelligence

Machine learning for algo trading

Neural Networks and Support Vector Machines

Comparison of K-means and Backpropagation Data Mining Algorithms

Learning Instance-Specific Predictive Models

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Question 2 Naïve Bayes (16 points)

Towards running complex models on big data

Large-Sample Learning of Bayesian Networks is NP-Hard

Maximizing Return and Minimizing Cost with the Decision Management Systems

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

CSE 473: Artificial Intelligence Autumn 2010

Penalized regression: Introduction

Data Mining Applications in Higher Education

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Linear Threshold Units

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Customer Classification And Prediction Based On Data Mining Technique

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Chapter 7: Simple linear regression Learning Objectives

A Comparison of Novel and State-of-the-Art Polynomial Bayesian Network Learning Algorithms

Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

Information Management course

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

The Data Mining Process

Model-based Synthesis. Tony O Hagan

Lecture 10: Regression Trees

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Big Data: Image & Video Analytics

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Statistical Machine Learning

: Introduction to Machine Learning Dr. Rita Osadchy

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Multi-Class and Structured Classification

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

HIGH PERFORMANCE BIG DATA ANALYTICS

Tracking Groups of Pedestrians in Video Sequences

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Chapter 28. Bayesian Networks

Deep Learning For Text Processing

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

Microsoft Azure Machine learning Algorithms

Statistics Graduate Courses

We employed reinforcement learning, with a goal of maximizing the expected value. Our bot learns to play better by repeated training against itself.

1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

Traffic Driven Analysis of Cellular Data Networks

Machine Learning.

Statistical machine learning, high dimension and big data

A Property & Casualty Insurance Predictive Modeling Process in SAS

Neural Networks and Back Propagation Algorithm

Transcription:

Big Data, Machine Learning, Causal Models Sargur N. Srihari University at Buffalo, The State University of New York USA Int. Conf. on Signal and Image Processing, Bangalore January 2014 1

Plan of Discussion 1. Big Data Sources Analytics 2. Machine Learning Problem types 3. Causal Models Representation Learning 2

Big Data Moore s law World s digital content doubles in18 months Daily 2.5 Exabytes (10 18 or quintillion) data created Large and Complex Data Social Media: Text (10m tweets/day) and Images Remote sensing, Wireless Networks Limitations due to big data Energy (fracking with images, sound, data) 3 Internet Search, Meteorology, Astronomy

Dimensions of Big Data (IBM) Volume Convert 1.2 terabytes (1,000GB) each day to sentiment analysis Velocity Analyze 5m trade events/day to detect fraud Variety Monitor hundreds of live video feeds 4

Big Data Analytics Descriptive Analytics What happened? Models Predictive Analytics What will happen? Predict class Prescriptive Analytics How to improve predicted future? Intervention 5

Machine Learning and Big Data Analytics 1. Perfect Marriage Machine Learning Computational/Statistical methods to model data 2. Types of problems solved Predictive Classification: Sentiment Predictive Regression: LETOR Collective Classification: Speech Missing Data Estimation: Clustering Intervention: Causal Models 6

Role of Machine Learning Problems involving uncertainty Perceptual data (images, text, speech, video) Information overload Large Volumes of training data Limitations of human cognitive ability Correlations hidden among many features Constantly Changing Data Streams Search engine constantly needs to adapt Software advances will come from ML Principled way for high performance systems

Need for Probability Models Uncertainty is ubiquitous Future: can never predict with certainty, e.g., weather, stock price Present and Past: important aspects not observed with certainty, e.g., rainfall, corporate data Tools Probability theory: 17 th century (Pascal, Laplace) PGMs: for effective use with large numbers of variables late 20 th century 8

History of ML First Generation (1960-1980) Perceptrons, Nearest-neighbor, Naïve Bayes Special Hardware, Limited performance Second Generation (1980-2000) ANNs, Kalman, HMMs, SVMs HW addresses, speech reco, postal words Difficult to include domain knowledge Black box models fitted to large data sets Third Generation (2000-Present) PGMs, Fully Bayesian, GP 20 x 20 cell Adaptive Wts Image segmentation, Text analytics (NE Tagging) Expert prior knowledge with statistical models USPS-MLOCR USPS-RCR

Role of PGMs in ML Large Data sets PGMs Provide understanding of: model relationships (theory) problem structure (practice) Nature of PGMs 1. Represent joint distributions 1. Many variables without full independence 2. Expressive : Trees, graphs 3. Declarative representation Knowledge of feature relationships 2. Inference: Separate model/algorithm errors 3. Learning 10

Directed vs Undirected Directed Graphical Models Bayesian Networks Useful in many real-world domains Undirected Markov Networks, Also, Markov Random Fields When no natural directionality between variables Simpler Independence/inference(no D-separation) Partially directed models E.g., some forms of conditional random fields 11

BN REPRESENTATION 12

Complexity of Joint Distributions For Multinomial with k states for each variable Full Distribution requires k n -1 parameters with n=6, k=5 need 15,624 parameters 13

Bayesian Network Nodes: random variables, Edges: influences Local Models are combined by multiplying them n i=1 p(x i pax i ) Provides a factorization of joint distribution: G={V,E,θ} X 2 X 6 X 4 P(x) = P(x 4 )P(x 6 x 4 )P(x 2 x 6 )P(x 3 x 2 )P(x 1 x 2, x 6 )P(x 5 x 1, x 2 ) θ: 4+(3*24)+(2*125)=326 parameters Organized as Six CPTs, e.g. X 3 X1 P(X 5 X 1,X 2 ) X 5 14

Complexity of Inference in a BN Probability of Evidence n P(E = e) = P(X i pa(x i )) E =e X \ E An intractable problem No of possible assignments for X is k n i=1 They have to be counted #P complete Tractable if tree-width < 25 Summed over all settings of values for n variables X\E P: solution in polynomial time NP: verified in polynomial time #P complete: how many solutions Approximations are usually sufficient (hence sampling) When P(Y=y E=e)=0.29292, approximation yields 0.3

BN PARAMETER LEARNING 16

Learning Parameters of BN Parameters define local interactions Straight-forward since local CPDs G={V,E,θ} X 4 Max Likelihood Estimate P(x 5 x 1,x 2 ) Bayesian Estimate with Dirichlet Prior X 6 X 2 X 3 X1 X 5 Prior Likelihood Posterior Dirichlet Prior

BN STRUCTURE LEARNING 18

Need for BN Structure Learning Structure Causality cannot easily be determined by experts Variables and Structure may change with new data sets Parameters When structure is specified by an expert Experts cannot usually specify parameters Data sets can change over time Need to learn parameters when structure is learnt 19

Elements of BN Structure Learning 1. Local: Independence Tests 1. Measures of Deviance-from-independence between variables 2. Rule for accepting/rejecting hypothesis of independence 2. Global: Structure Scoring Goodness of Network 20

Independence Tests 1. For variables x i, x j in data set D of M samples 1. Pearson s Chi-squared (X 2 ) statistic d χ 2 (D ) = x i,x j Independence à d Χ (D)=0, larger value when Joint M[x,y] and expected counts (under independence assumption) differ 2. Mutual Information (K-L divergence) between joint and product of marginals Independence à d I (D)=0, otherwise a positive value 2. Decision rule ( M[x i, x j ] M ˆP(x i ) ˆP(x j )) 2 M ˆP(x i ) ˆP(x j ) d I (D ) = 1 M M[x i, x j ]log M[x i, x j ] M[x i ]M[x j ] R d,t (D ) = x i,x j Accept d(d ) t Reject d(d ) > t Sum over all values of x i and x j False Rejection probability due to choice of t is its p-value

Structure Scoring 1. Log-likelihood Score for G with n variables score L (G : D ) = n D i=1 log ˆP(x i pax i ) Sum over all data and variables x i 2. Bayesian Score score B (G :D ) = log p(d G ) + log p(g ) 3. Bayes Information Criterion With Dirichlet prior over graphs score BIC (G : D) = l( ˆ θ G : D) log M 2 Dim(G ) 22

BN Structure Learning Algorithms Constraint-based Find best structure to explain determined dependencies Sensitive to errors in testing individual dependencies Score-based Search the space of networks to find high-scoring structure Since space is super-exponential, need heuristics Optimized Branch and Bound (decampos, Zheng and Ji, 2009) Bayesian Model Averaging Prediction over all structures May not have closed form, Limitation of X 2 Peters, Danzing and Scholkopf, 2011

Greedy BN Structure Learning Consider pairs of variables ordered by χ 2 value Add next edge if score is increased G*={V,E*,θ*} Start Score s k-1 G c1 ={V,E c1,θ c1 } Candidate x 4 à x 5 Score s c1 G c1 ={V,E c1,θ c1 } Candidate x 5 à x 4 Score s c2 Choose G c1 or G c2 depending on which one increases the score s(d,g) evaluate using cross validation on validation set 24

Branch and Bound Algorithm Score-based Minimum Description Length Log-loss O(n. 2 n ) Any-time algorithm Stop at current best solution 25

Causal Models Causality: Relation between an event (the cause) and a second event (the effect), where the second is understood to be a consequence of the first Examples Rain causes mud, Smoking causes cancer, Altitude lowers temperature Bayesian Network need not be causal It is only an efficient representation in terms of 26 conditional distributions

Causality in Philosophy Dream of philosophers Democritus 460-370BC, father of modern science I would rather discover one causal law than gain the kingdom of Persia Indian philosophy Karma in Sanatana Dharma A person s actions causes certain effects in current and future life either positively or negatively 27

Causality in Medicine Medical treatment Possible effects of a medicine Right treatment saves lives Vitamin D and Arthritis Correlation versus Causation Need for Randomized Correlation Test 28

Relationships between events A B A Causes B B Causes A Common Causes for A and B, which do not cause each other Correlation is a broader concept than causation 29

Examples of Causal Model Smoking Lung Cancer Other Causes of Lung Cancer Statement `Smoking causes cancer implies an asymmetric relationship: Smoking leads to lung cancer, but Lung cancer will not cause smoking Arrow indicates such causal relationship No arrow between smoking and `Other causes of lung cancer Means: no direct causal relationship between them 30

Statistical Modeling of Cause-Effect 31

Additive Noise Model Test if variables x and y are independent If not test if y =f(x)+e is consistent with data Where f is obtained by nonlinear regression If residuals e =y - f(x) are independent of x then accept y=f(x)+e. If not reject it. Similarly test for x =g(y)+e If both accepted or both rejected then need more complex relationship 32

Regression and Residuals Residuals more dependent upon temperature p value: forward model = 0.0026 backward model= 5 x 10-12 Admit altitude à temperature 33

Directed (Causal) Graphs A E C F B D A and B are causally independent; C, D, E, and F are causally dependent on A and B; A and B are direct causes of C; A and B are indirect causes of D, E and F; If C is prevented from changing with A and B, then A and B will no longer cause changes in D, E and F 34

Causal BN Structure Learning Construct PDAG by removing edges from complete undirected graph Use X 2 test to sort dependencies Orient most dependent edge using additive noise model Apply causal forward propagation to orient other undirected edges Repeat until all edges are oriented 35

Comparison of Algorithms Greedy B & B Causal 36

Intervention Intervention of turning sprinkler on 37

1. Big Data Conclusion Scientific Discovery, Most Advances in Technology 2. Machine Learning Methods suited to Analytics: Descriptive, Predictive 3. Causal Models Scientific discovery, Prescriptive Analytics