Correlation mining in massive data
|
|
|
- Wilfrid Carter
- 10 years ago
- Views:
Transcription
1 Correlation mining in massive data Alfred Hero University of Michigan - Ann Arbor April 11, / 58
2 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 2 / 58
3 Outline 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 3 / 58
4 Correlation mining The objective of correlation mining is to discover interesting or unusual patterns of dependency among a large number of variables (sequences, signals, images, videos). Related to: Pattern mining, anomaly detection, cluster analysis Graph analytics, community detection, node/edge analysis Gaussian Graphical models (GGM) - Lauritzen 1996 Big Data aspects: Large numbers of signals, images, videos Observed correlations between signals are incomplete and noisy Number of samples number of objects of interest 4 / 58
5 Correlation mining for Internetwork anomaly detection Patwari, H and Pacholski, Manifold learning visualization of network traffic data, SIGCOMM / 58
6 Correlation mining for SPAM community detection p = 100, 000, n = 30 6 / 58
7 Correlation mining for materials science p = 1, 000, 000, 000, n = 1000 to 100, 000 Park, Wei, DeGraef, Shah, Simmons and H, EBSD image segmentation using a physics-based model, submitted 7 / 58
8 Correlation mining for neuroscience p = 100, n 1 = 50, n 2 = 50 Xu, Syed and H, EEG spatial decoding with shrinkage optimized directed information assessment, ICASSP / 58
9 Correlation mining for finance p = 40, 000, n 1 = 60, n 2 = 80 Source: What is behind the fall in cross assets correlation? J-J Ohana, 30 mars 2011, Riskelia s blog. Left: Average correlation: 0.42, percent of strong relations 33% Right: Average correlation: 0.3, percent of strong relations 20% Hubs of high correlation influence the market. What hubs changed or persisted in Q4-10 and Q1-11? 9 / 58
10 Correlation mining for biology: gene-gene network p = 24, 000, n = 270 Source: Huang,..., and H, PLoS Genetics, 2011 Gene expression correlation graph Q: What genes are hubs in this correlation graph? 10 / 58
11 Correlation mining for biology: gene-gene network 11 / 58
12 Correlation mining for predictive medicine: bipartite graph Q: What genes are predictive of certain symptom combinations? Firouzi, Rajaratnam and H, Predictive correlation screening, AISTATS / 58
13 Outline 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 13 / 58
14 Sample correlation: p = 2 variables n = 50 samples Sample correlation: ĉorr X,Y = n i=1 (X i X )(Y i Y ) n i=1 (X i X ) 2 [ 1, 1] n i=1 (Y i Y ) 2, Positive correlation =1 Negative correlation =-1 14 / 58
15 Sample correlation for random sequences: p = 2, n = 50 Q: Are the two time sequences X i and Y j correlated, e.g. ĉorr XY > 0.5? 15 / 58
16 Sample correlation for random sequences: p = 2, n = 50 Q: Are the two time sequences X i and Y j correlated? A: No. Computed over range i = 1,... 50: ĉorr XY = / 58
17 Sample correlation for random sequences: p = 2, n < 15 Q: Are the two time sequences X i and Y j correlated? A: Yes. ĉorr XY > 0.5 over range i = 3, and ĉorr XY < 0.5 over range i = 29,..., / 58
18 Real-world example: reported correlation divergence Source: FuturesMag.com 18 / 58
19 Correlating a set of p = 20 sequences Q: Are any pairs of sequences correlated? Are there patterns of correlation? 19 / 58
20 Thresholded (0.5) sample correlation matrix Apparent patterns emerge after thresholding each pairwise correlation at ± / 58
21 Associated sample correlation graph Graph has an edge between node (variable) i and j if ij-th entry of thresholded correlation is non-zero. Sequences are actually uncorrelated Gaussian. 21 / 58
22 The problem of false discoveries: phase transitions Number of discoveries exhibit phase transition phenomenon This phenomenon gets worse as p/n increases. Example: false discoveries of high correlation for uncorrelated Gaussian variables 22 / 58
23 Outline 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 23 / 58
24 Random matrix measurement model Variable 1 Variable 2... Variable p Sample 1 X 11 X X 1p Sample 2 X 21 X X 2p Sample n X n1 X n2... X np n p measurement matrix X has i.i.d. elliptically distributed rows X 11 X 1p X 1 X = =. = [X 1,..., X p ] X n1 X np X n Columns of X index variables while rows index i.i.d. samples p p covariance (dispersion) matrix associated with each row is cov(x i ) = Σ 24 / 58
25 Sparse multivariate dependency models Two types of sparse (ensemble) correlation models: Sparse correlation (Σ) graphical models: Most correlation are zero, few marginal dependencies Examples: M-dependent processes, moving average (MA) processes Sparse inverse-correlation (K = Σ 1 ) graphical models Most inverse covariance entries are zero, few conditional dependencies Examples: Markov random fields, autoregressive (AR) processes, global latent variables Sometimes correlation matrix and its inverse are both sparse Often only one of them is sparse Refs: Meinshausen-Bühlmann (2006), Friedman (2007), Bannerjee (2008), Wiesel-Eldar-H (2010). 25 / 58
26 Gaussian graphical models - GGM - (Lauritzen 1996) Multivariate Gaussian model p(x) = K 1/2 (2π) exp 1 p/2 2 i,j=1 where K = [cov(x)] 1 : p p precision matrix G has an edge e ij iff [K] ij 0 p x i x j [K] ij Adjacency matrix B of G obtained by thresholding K B = h(k), h(u) = 1 2 (sgn( u ρ) + 1) ρ is arbitrary positive threshold 26 / 58
27 Banded Gaussian graphical model G Figure: Left: inverse covariance matrix K. Right: associated graphical model Example: Autoregressive (AR) process: X n+1 = ax n + W n for which X = [X 1,..., X p ] satisfies [I A]X = W and K = cov 1 (X) = σ 2 W [I A][I A]T. 27 / 58
28 Block diagonal Gaussian graphical model G Figure: Left: inverse covariance matrix K. Right: associated graphical model Example: X n = [Y n, Z n ], Y n, Z n independent processes. 28 / 58
29 Two coupled block Gaussian graphical model G Example: X n = [Y n + U n, U n, Z n + U n ], Y n, Z n, U n independent AR processes. 29 / 58
30 Multiscale Gaussian graphical model G 30 / 58
31 Spatial graphical model: Poisson random field Let p t (x, y) be a space-time process satisfying Poisson equation 2 p t x p t y 2 = W t where W t = W t (x, y) is driving process. For small x, y, p satisfies the difference equation: Xi,j t = (X i+1,j t + X i 1,j t ) 2 y + (Xi,j+1 t + X i,j 1 t ) 2 x Wi,j t 2 x 2 y 2( 2 x + 2 y) In matrix form, as before: [I A]X t = W t and K = cov 1 (X t ) = σw 2 [I A][I A]T A is sparse pentadiagonal matrix. 31 / 58
32 Random field generated from Poisson equation Figure: Poisson random field. W t = N iso + sin(ω 1 t)e 1 + sin(ω 2 t)e 2 (ω 1 = 0.025, ω 2 = , SNR=0dB). 32 / 58
33 Empirical partial correlation map for spatial random field Figure: Empirical parcorr at various threshold levels. p=600, n= / 58
34 Empirical correlation map of spatial random field Figure: Empirical corr at various threshold levels. p=600, n= / 58
35 Outline 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 35 / 58
36 Correlation mining: theory Given Number of nodes = p Number of samples =n Correlation threshold = ρ p p matrix of sample correlations Sparse graph assumption: # true edges p 2 Questions Can we predict critical phase transition threshold ρ c? What level of confidence/significance can one have on discoveries for ρ > ρ c? Are there ways to predict the number of required samples for given threshold level and level of statistical significance? 36 / 58
37 Relevant work Regularized l 2 or l F covariance estimation Banded covariance model: Bickel-Levina (2008) Sparse eigendecomposition model: Johnstone-Lu (2007) Stein shrinkage estimator: Ledoit-Wolf (2005), Chen-Weisel-Eldar-H (2010) Gaussian graphical model selection l 1 regularized GGM: Meinshausen-Bühlmann (2006), Wiesel-Eldar-H (2010). Bayesian estimation: Rajaratnam-Massam-Carvalho (2008) Independence testing Sphericity test for multivariate Gaussian: Wilks (1935) Maximal correlation test: Moran (1980), Eagleson (1983), Jiang (2004), Zhou (2007), Cai and Jiang (2011) Correlation hub screening (H, Rajaratnam 2011, 2012) Fixed n, asymptotic in p, covers partial correlation too. Discover degree k hubs test maximal k-nn correlation. 37 / 58
38 Hub screening theory (H and Rajaratnam 2012) Empirical hub discoveries: For threshold ρ and degree parameter δ define number N δ,ρ of vertices in sample partial-correlation graph with degree d i δ N δ,ρ = p i=1 φ δ,i { 1, card{j : j i, Ωij ρ} δ φ δ,i = 0, o.w. Ω = diag(r ) 1/2 R diag(r ) 1/2 is sample partial correlation matrix and R is sample correlation matrix R = diag( ˆΣ) 1/2 ˆΣdiag( ˆΣ) 1/2 38 / 58
39 Asymptotic familywise false-discovery rate Asymptotic limit on false discoveries: (H and Rajaratnam 2012): Assume that rows of X are i.i.d. with bounded elliptically contoured density and block sparse covariance (null hypothesis). Theorem Let p and ρ = ρ p satisfy lim p p 1/δ (p 1)(1 ρ 2 p) (n 2)/2 = e n,δ. Then { 1 exp( λδ,ρ,n /2), δ = 1 P(N δ,ρ > 0) 1 exp( λ δ,ρ,n ), δ > 1. ( ) p 1 λ δ,ρ,n = p (P 0 (ρ, n)) δ δ P 0 (ρ, n) = 2B((n 2)/2, 1/2) 1 ρ (1 u 2 ) n 4 2 du 39 / 58
40 Elements of proof Z-score representations for sample (partial) correlation (P) R R = U T U, P = U T [UU T ] 2 U, (U = n 1 p) P 0 (ρ, n): probability that a uniformly distributed vector Z S n 2 falls in cap(r, U) cap(r, U) with r = 2(1 ρ). As p, N δ,ρ behaves like a Poisson random variable: P(N δ,ρ = 0) e λ δ,ρ,n 40 / 58
41 Poisson-like convergence rate Under assumption that p 1/δ (p 1)(1 ρ 2 p) (n 2)/2 = O(1) can apply Chen-Stein to obtain bound ( }) P(N δ,ρ = 0) e λ δ,ρ,n O max {p 1/δ, p 1/(n 2), p,n,k,δ p,n,k,δ is dependency coefficient between δ-nearest-neighbors of Y i and its p k furthest neighbors 41 / 58
42 Predicted phase transition for false hub discoveries False discovery probability: P(N δ,ρ > 0) 1 exp( λ δ,ρ,n ) p=10 (δ = 1) p= / 58
43 Predicted phase transition for false hub discoveries False discovery probability: P(N δ,ρ > 0) 1 exp( λ δ,ρ,n ) Critical threshold: p=10 (δ = 1) p=10000 ρ c = 1 c δ,n (p 1) 2δ/δ(n 2) 2 43 / 58
44 Phase transitions as function of δ, p 44 / 58
45 Experimental Design Table (EDT): mining connected nodes n α \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \0.75 Table: Design table for spike-in model: p = 1000, detection power β = 0.8. Achievable limits in FPR (α) as function of n, minimum detectable correlation ρ 1, and level α correlation threshold (shown as ρ 1 \ρ in table). 45 / 58
46 Experimental validation Figure: Targeted ROC operating points (α, β) (diamonds) and observed operating points (number pairs) of correlation screen designed from Experimental Design Table. Each observed operating point determined by the sample size n ranging over n = 10, 15, 20, 25, 30, / 58
47 From false positive rate for fixed ρ to p-values Recall asymptotic false positive probability for fixed δ, n, ρ P(N δ,ρ > 0) = 1 exp( λ δ,ρ,n ) Can relate false postive probability to maximal correlation: P(N δ,ρ > 0) = P(max ρ i (δ) > ρ) i with ρ i (k) the (partial) correlation between i and its k-nn. p-value associated with vertex i having observed k-nn (partial) correlation = ˆρ i (k). pv k (i) = 1 exp( λ k,ˆρi (k),n) 47 / 58
48 Outline 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 48 / 58
49 Example: 4-node-dependent Graphical Model Figure: Graphical model with 4 nodes. Vertex degree distribution: 1 degree 1 node, 2 degree 2 nodes, 1 degree 3 node. 49 / 58
50 Example: First 10 nodes of 1000-node Graphical Model 4 node Gaussian graphical model embedded into 1000 node network with 996 i.i.d. nuisance nodes Simulate 40 observations from these 1000 variables. Critical threshold is ρ c,1 = % level threshold is ρ = / 58
51 Example: 1000-node Graphical Model Note: log(λ) = 2 is equivalent to pv= 1 e elogλ = / 58
52 Example: NKI gene expression dataset Netherlands Cancer Institute (NKI) early stage breast cancer p = 24, 481 gene probes on Affymetrix HU133 GeneChip 295 samples (subjects) Peng et al used 266 of these samples to perform covariance selection They preprocessed (Cox regression) to reduce number of variables to 1, 217 genes They applied sparse partial correlation estimation (SPACE) Here we apply hub screening directly to all 24, 481 gene probes Theory predicts phase transition threshold ρ c,1 = / 58
53 NKI p-value waterfall plot for partial correlation hubs: selected discoveries shown 53 / 58
54 NKI p-value waterfall plot for partial correlation hubs: Peng et al discoveries shown 54 / 58
55 NKI p-value waterfall plot for correlation hubs 55 / 58
56 Outline 1 Corrrelation mining 2 Caveats: why we need to be careful! 3 Dependency models 4 Correlation Mining Theory 5 Experiments 6 Summary and perspectives 56 / 58
57 Summary and perspectives Conclusions For large p correlation mining hypersensitive to false positives Theory of false positive phase transitions and significance has been developed in context of hub screening on R, R, and R xr xy. Other problems of interest Higher order measures of dependence (information flow) Time dependent samples of correlated multivariates Missing data some components of multivariate are intermittent Screening for other non-isomorphic sub-graphs Vector valued node attributes: canonical correlations. Misaligned signals: account for registration errors. 57 / 58
58 Y. Chen, A. Wiesel, Y.C. Eldar, and A.O. Hero. Shrinkage algorithms for mmse covariance estimation. Signal Processing, IEEE Transactions on, 58(10): , A. Hero and B. Rajaratnam. Hub discovery in partial correlation models. IEEE Trans. on Inform. Theory, 58(9): , available as Arxiv preprint arxiv: A.O. Hero and B. Rajaratnam. Large scale correlation screening. Journ. of American Statistical Association, 106(496): , Dec Available as Arxiv preprint arxiv: A. Wiesel, Y.C. Eldar, and A.O. Hero. Covariance estimation in decomposable gaussian graphical models. Signal Processing, IEEE Transactions on, 58(3): , K.S. Xu, M. Kliger, Y. Chen, P. Woolf, and A.O. Hero. Revealing social networks of spammers through spectral clustering. In IEEE International COnference on Communications (ICC), june / 58
Extracting correlation structure from large random matrices
Extracting correlation structure from large random matrices Alfred Hero University of Michigan - Ann Arbor Feb. 17, 2012 1 / 46 1 Background 2 Graphical models 3 Screening for hubs in graphical model 4
Statistical machine learning, high dimension and big data
Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
Complex Networks Analysis: Clustering Methods
Complex Networks Analysis: Clustering Methods Nikolai Nefedov Spring 2013 ISI ETH Zurich [email protected] 1 Outline Purpose to give an overview of modern graph-clustering methods and their applications
Sections 2.11 and 5.8
Sections 211 and 58 Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I 1/25 Gesell data Let X be the age in in months a child speaks his/her first word and
BayesX - Software for Bayesian Inference in Structured Additive Regression
BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Part 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
Lecture 8: Signal Detection and Noise Assumption
ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Uncertainty quantification for the family-wise error rate in multivariate copula models
Uncertainty quantification for the family-wise error rate in multivariate copula models Thorsten Dickhaus (joint work with Taras Bodnar, Jakob Gierl and Jens Stange) University of Bremen Institute for
(67902) Topics in Theory and Complexity Nov 2, 2006. Lecture 7
(67902) Topics in Theory and Complexity Nov 2, 2006 Lecturer: Irit Dinur Lecture 7 Scribe: Rani Lekach 1 Lecture overview This Lecture consists of two parts In the first part we will refresh the definition
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Graphical Modeling for Genomic Data
Graphical Modeling for Genomic Data Carel F.W. Peeters [email protected] Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Network (Tree) Topology Inference Based on Prüfer Sequence
Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 [email protected],
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Lecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Traffic Driven Analysis of Cellular Data Networks
Traffic Driven Analysis of Cellular Data Networks Samir R. Das Computer Science Department Stony Brook University Joint work with Utpal Paul, Luis Ortiz (Stony Brook U), Milind Buddhikot, Anand Prabhu
NETZCOPE - a tool to analyze and display complex R&D collaboration networks
The Task Concepts from Spectral Graph Theory EU R&D Network Analysis Netzcope Screenshots NETZCOPE - a tool to analyze and display complex R&D collaboration networks L. Streit & O. Strogan BiBoS, Univ.
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Department of Economics
Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Analysis of Internet Topologies
Analysis of Internet Topologies Ljiljana Trajković [email protected] Communication Networks Laboratory http://www.ensc.sfu.ca/cnl School of Engineering Science Simon Fraser University, Vancouver, British
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method
578 CHAPTER 1 NUMERICAL METHODS 1. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS As a numerical technique, Gaussian elimination is rather unusual because it is direct. That is, a solution is obtained after
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Tracking Moving Objects In Video Sequences Yiwei Wang, Robert E. Van Dyck, and John F. Doherty Department of Electrical Engineering The Pennsylvania State University University Park, PA16802 Abstract{Object
Message-passing sequential detection of multiple change points in networks
Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Mining Social-Network Graphs
342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is
Methods of Data Analysis Working with probability distributions
Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution,
Anomaly detection. Problem motivation. Machine Learning
Anomaly detection Problem motivation Machine Learning Anomaly detection example Aircraft engine features: = heat generated = vibration intensity Dataset: New engine: (vibration) (heat) Density estimation
A Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, [email protected] Abstract Most text data from diverse document databases are unsuitable for analytical
Statistics and Probability Letters. Goodness-of-fit test for tail copulas modeled by elliptical copulas
Statistics and Probability Letters 79 (2009) 1097 1104 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro Goodness-of-fit test
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan
Network Monitoring in Multicast Networks Using Network Coding
Network Monitoring in Multicast Networks Using Network Coding Tracey Ho Coordinated Science Laboratory University of Illinois Urbana, IL 6181 Email: [email protected] Ben Leong, Yu-Han Chang, Yonggang Wen
Random graphs with a given degree sequence
Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.
Direct Methods for Solving Linear Systems. Matrix Factorization
Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011
Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 2, FEBRUARY 2002 359 Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel Lizhong Zheng, Student
A Statistical Framework for Operational Infrasound Monitoring
A Statistical Framework for Operational Infrasound Monitoring Stephen J. Arrowsmith Rod W. Whitaker LA-UR 11-03040 The views expressed here do not necessarily reflect the views of the United States Government,
Probability and Random Variables. Generation of random variables (r.v.)
Probability and Random Variables Method for generating random variables with a specified probability distribution function. Gaussian And Markov Processes Characterization of Stationary Random Process Linearly
1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA
Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA Abstract Virtually all businesses collect and use data that are associated with geographic locations, whether
Practical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
QUALITY ENGINEERING PROGRAM
QUALITY ENGINEERING PROGRAM Production engineering deals with the practical engineering problems that occur in manufacturing planning, manufacturing processes and in the integration of the facilities and
Design of LDPC codes
Design of LDPC codes Codes from finite geometries Random codes: Determine the connections of the bipartite Tanner graph by using a (pseudo)random algorithm observing the degree distribution of the code
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs
CSE599s: Extremal Combinatorics November 21, 2011 Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs Lecturer: Anup Rao 1 An Arithmetic Circuit Lower Bound An arithmetic circuit is just like
LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
CONDITIONAL, PARTIAL AND RANK CORRELATION FOR THE ELLIPTICAL COPULA; DEPENDENCE MODELLING IN UNCERTAINTY ANALYSIS
CONDITIONAL, PARTIAL AND RANK CORRELATION FOR THE ELLIPTICAL COPULA; DEPENDENCE MODELLING IN UNCERTAINTY ANALYSIS D. Kurowicka, R.M. Cooke Delft University of Technology, Mekelweg 4, 68CD Delft, Netherlands
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.
Chapter 3 Kolmogorov-Smirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Model-Based Cluster Analysis for Web Users Sessions
Model-Based Cluster Analysis for Web Users Sessions George Pallis, Lefteris Angelis, and Athena Vakali Department of Informatics, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece [email protected]
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
The Open University s repository of research publications and other research outputs
Open Research Online The Open University s repository of research publications and other research outputs The degree-diameter problem for circulant graphs of degree 8 and 9 Journal Article How to cite:
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Degrees of Freedom and Model Search
Degrees of Freedom and Model Search Ryan J. Tibshirani Abstract Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed
Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011
Behavior Analysis in Crowded Environments XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011 Behavior Analysis in Sparse Scenes Zelnik-Manor & Irani CVPR
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Multivariate normal distribution and testing for means (see MKB Ch 3)
Multivariate normal distribution and testing for means (see MKB Ch 3) Where are we going? 2 One-sample t-test (univariate).................................................. 3 Two-sample t-test (univariate).................................................
Link Prediction in Social Networks
CS378 Data Mining Final Project Report Dustin Ho : dsh544 Eric Shrewsberry : eas2389 Link Prediction in Social Networks 1. Introduction Social networks are becoming increasingly more prevalent in the daily
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
Correlation in Random Variables
Correlation in Random Variables Lecture 11 Spring 2002 Correlation in Random Variables Suppose that an experiment produces two random variables, X and Y. What can we say about the relationship between
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors
Chapter 9. General Matrices An n m matrix is an array a a a m a a a m... = [a ij]. a n a n a nm The matrix A has n row vectors and m column vectors row i (A) = [a i, a i,..., a im ] R m a j a j a nj col
The Characteristic Polynomial
Physics 116A Winter 2011 The Characteristic Polynomial 1 Coefficients of the characteristic polynomial Consider the eigenvalue problem for an n n matrix A, A v = λ v, v 0 (1) The solution to this problem
Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis
Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.0, [email protected] Chapter 06: Network analysis Version: April 8, 04 / 3 Contents Chapter
Classification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
Network Algorithms for Homeland Security
Network Algorithms for Homeland Security Mark Goldberg and Malik Magdon-Ismail Rensselaer Polytechnic Institute September 27, 2004. Collaborators J. Baumes, M. Krishmamoorthy, N. Preston, W. Wallace. Partially
