STATISTICAL MODELS AND ISSUES IN THE ANALYSIS OF NETWORK DATA

Size: px
Start display at page:

Download "STATISTICAL MODELS AND ISSUES IN THE ANALYSIS OF NETWORK DATA"

Transcription

1 STATISTICAL MODELS AND ISSUES IN THE ANALYSIS OF NETWORK DATA George Michailidis Department of Statistics, University of Michigan gmichail Plenary Talk UP-STAT 13 Rochester Institute of Technology, April 2013

2 WHY NETWORKS? Relative small field of study until late 1990s Explosive growth of interest and work on networks from 2000 forward Main factors: Development of high-throughput technologies Systems level perspective in science New modeling techniques and computational advances

3 SOME RECENT DEVELOPMENTS I Rapid increase in publications

4 SOME RECENT DEVELOPMENTS II New books, courses and journals

5 SOME RECENT DEVELOPMENTS III Dedicated workshops

6 WHAT IS A NETWORK? A collection of interconnected entities Mathematically, it is convenient to represent it as a graph G = (V,E), where V denotes the set of nodes (vertices) and E the set of edges

7 EXAMPLES OF NETWORKS Networks have become an integral tool for addressing diverse problems in a number of scientific fields. For example: Technological (e.g. communications, transportation, energy, sensor) Biological (e.g. gene regulation, protein interactions, predator-prey relations) Social (e.g. friendship, , trade flows) Informational (e.g. Web, Twitter, peer-to-peer)

8 STATISTICS AND NETWORK ANALYSIS - Network Analysis has attracted participants from diverse scientific fields, including information scientists, statisticians, mathematicians, applied physicists, complex systems theorists,... - Uneven developments - Some topics rediscover and employ established techniques, others require new models and tools - Nevertheless, some key topics have merged that are truly statistical in nature

9 STATISTICS AND NETWORK ANALYSIS Network characterization (e.g. importance of nodes, identification of network communities, properties of the degree distribution) Network sampling - novel sampling schemes to construct a network; e.g. induced subgraph sampling, incident subgraph sampling, snowball sampling, link tracing Network inference - identify the topology from data; e.g. link prediction, graphical modeling Network dynamics: (i) stochastic processes (flows) on graphs, (ii) evolution of graphs over time Network visualization Time-varying networks Incorporation of network information in statistical inference

10 NETWORK INFERENCE BASED ON GRAPHICAL MODELS Some background on graphical models Represent conditional independence relationships between a set of random variables No edge between X j and X j X j is independent of X j conditional on all other variables Typically, estimated from a set of n iid observations on p variables

11 EXAMPLE 1: TEXT MINING address mail phone offic inform gener develop project work time student graduat system includ program research group interest comput engin public fax home page link web scienc univers depart

12 EXAMPLE 2: ROLL CALL DATA Akaka Alexander Allard Allen Baucus Bayh Bennett Biden Bingaman Bond Boxer Brownback Bunning Burns Burr Byrd Cantwell Carper Chafee Chambliss Clinton Coburn Cochran Coleman Collins Conrad Cornyn Corzine Craig Crapo Dayton DeMint DeWine Dodd Dole Domenici Dorgan Durbin Ensign Enzi Feingold Feinstein Frist Graham Grassley Gregg Hagel Harkin Hatch Hutchison Inhofe Inouye Isakson Jeffords Johnson Kennedy Kerry Kohl Kyl Landrieu Lautenberg Leahy Levin Lieberman Lincoln Lott Lugar Martinez McCain McConnell Mikulski Murkowski Murray Nelson Nelson Obama Pryor Reed Reid Roberts Rockefeller Salazar Santorum Sarbanes Schumer Sessions Shelby Smith Snowe Specter Stabenow Stevens Sununu Talent Thomas Thune Vitter Voinovich Warner Wyden

13 EXAMPLE 3: GENE NETWORKS

14 GAUSSIAN GRAPHICAL MODELS X 1,...,X p jointly follow N(0,Σ) Dependence structure fully characterized by the covariance structure Let ρ j,j = cor(x j,x j others) denote the partial correlation. PARTIAL CORRELATION Nodes j and j are connected ρ j,j 0

15 GAUSSIAN GRAPHICAL MODELS (CTD) INVERSE COVARIANCE MATRIX Let Ω = Σ 1 denote the inverse covariance matrix. We have ρ j,j ω j,j. 1 ρ 13 3 ω 1,1 0 ω 1,3 ω 1,4 0 ρ 35 0 ω 2,2 0 ω 2,4 0 ρ ρ Ω = ω 3,1 0 ω 3,3 ω 3,4 ω 3,5 ω 4,1 ω 4,2 ω 4,3 ω 4,4 0 2 ρ ω 5,3 0 ω 5,5 Hence, estimating Gaussian graphical model Estimating Ω Also, estimating the graph corresponds to identifying the zeros in Ω.

16 THE CASE OF HIGH-DIMENSIONAL DATA What happens if we have few samples and many more variables? Some examples: Biological networks: samples in the hundreds (at best), molecular entities in the thousands Text mining: both documents and corpus size in the thousands, but one needs to estimate all pairwise relationships between words! Solution: impose sparsity

17 ESTIMATION OF A SPARSE INVERSE COVARIANCE MATRIX This issue was addressed in a paper by Dempster (1972) and then remained dormant for 35 years, until Meinshausen and Buhlmann (2006) developed a penalized (lasso) regression approach to solve it Since then, there have been over 100 papers looking at various modeling, computational and inference aspects of the problem

18 MAXIMUM LIKELIHOOD ESTIMATION OF A SPARSE INVERSE COVARIANCE MATRIX This goal can be accomplished by optimizing the following objective function, where Σ is the sample covariance matrix and 0 requires Ω to be positive definite max Ω 0 log(det(ω)) trace( ΣΩ) λ j j ω j,j Note that when λ = 0, Ω = ( Σ) 1

19 ILLUSTRATION OF SPARSITY AS A FUNCTION OF λ Sparse inverse covariance estimation with the graphical lasso 7

20 ILLUSTRATION: CS WEBPAGES AT CMU Faculty Project Computer Science Department Student Course

21 CS WEBPAGES AT CMU Used about 1400 webpages and focused on the 100 most frequent words Common Structure home web site fall fax spring page public mail send phone person list year select link note offic instructor problem book address topic relat hour work graduat number class professor access theori assist faculti algorithm specif time gener base student includ teach analysi develop interest associ structur data program model inform contact design project softwar languag applic system process area parallel construct implement comput recent research commun group engin member high perform paper current architectur laboratori distribut advanc lab support network studi scienc technolog introduct educ depart www univers center institut (A) Webpage site web page link home (C) Parallel programming distribut parallel system algorithm perform problem high (B) Research area/lab current research lab laboratori area member group (D) Software development softwar develop structur data program algorithm languag

22 ESTIMATED NETWORKS FOR FACULTY AND STUDENT WEBPAGES (A) Student scienc comput univers depart page research interest home inform student work offic system phone public program mail fax project engin link group graduat includ time web gener develop address paper area fall languag professor softwar teach current design applic base contact list relat recent class assist algorithm hour studi model analysi institut technolog laboratori implement introduct www number construct year faculti network center note topic process distribut instructor lab problem member person perform structur data architectur send educ associ access site spring parallel theori commun high book select support specif advanc (B) Faculty scienc comput univers depart page research interest home inform student work offic system phone public program mail fax project engin link group graduat includ time web gener develop address paper area fall languag professor softwar teach current design applic base contact list relat recent class assist algorithm hour studi model analysi institut technolog laboratori implement introduct www number construct year faculti network center note topic process distribut instructor lab problem member person perform structur data architectur send educ associ access site spring parallel theori commun high book select support specif advanc

23 INCORPORATING NETWORK INFORMATION IN STATISTICAL TESTING PROBLEMS Rationale: High-throughput techniques (sequencing, profiling) have enabled comprehensive monitoring of biological systems Analysis of high-throughput data typically yields a list of differentially expressed genes (proteins, metabolites, etc.), obtained by statistical testing for differences between two groups, for example, normal and disease or treatment and control This list has the potential to provide insight into a given biological phenomenon or phenotype, but in many cases it is hard to extract meaning from it

24 INCORPORATING NETWORK INFORMATION IN STATISTICAL TESTING PROBLEMS (CTD) Biomedical researchers in order to reduce the complexity in the data have resorted in grouping the genes into smaller sets (pathways) of related ones; e.g. according to their function The number of knowledge data bases and their content that can be used for such grouping is increasing at an accelerating pace (e.g. KEGG, GO, TRANSFAC, DIP,...)

25 PROBLEM FORMULATION Given n 1 samples for the control condition and n 2 samples for the treatment condition of expression data for p genes and the network of gene interactions (shown below), test for activation of selected subgraphs

26 A LATENT VARIABLE MODEL FORMULATION X 1 = γ 1 X 2 = ρ 12 X 1 + γ 2 = ρ 12 γ 1 + γ 2 X 3 = ρ 23 X 2 + γ 3 = ρ 23 ρ 12 γ 1 + ρ 23 γ 2 + γ 3

27 A LATENT VARIABLE MODEL FORMULATION X 1 = γ 1 Thus X = Λγ where X 2 = ρ 12 X 1 + γ 2 = ρ 12 γ 1 + γ 2 X 3 = ρ 23 X 2 + γ 3 = ρ 23 ρ 12 γ 1 + ρ 23 γ 2 + γ 3 Λ = ρ ρ 12 ρ 23 ρ 23 1

28 THE LATENT VARIABLE MODEL Let Y be the ith sample in the expression data Let Y = X + ε, with X the signal and ε N p (0,σ 2 ε I p ) the noise Define latent variables γ N p (µ,σ 2 γ I p ) Let the influence of the jth gene on the ith gene be Λ ij ; Λ = [Λ ij ] is called the Influence Matrix of the network. Y = Λγ + ε, Y N p (Λµ,σ 2 γ ΛΛ + σ 2 ε I p )

29 MIXED LINEAR MODEL REPRESENTATION Let (Yi C, µ C,Λ C ) and (Yi T, µ T,Λ T ) represent the data under control and treatment, then: Y = Ψβ + Πγ + ε where β = (µ C, µ T ) ( ΛC Λ Ψ = C Λ T Λ T ) Π = diag(λ C,...,Λ C,Λ T,...,Λ T ) [ γ E ε ] [ 0 = 0 ] [ γ ε ] [ σ 2 = γ I 0 0 σε 2 I ]

30 INFERENCE USING MLM Let l be an estimable linear combination of fixed effects (we call l a contrast vector) and consider the test: H 0 : lβ = 0 vs. H 1 : lβ 0 Consider the Wald test statistic: T = l ˆβ l ˆQl Under the null hypothesis, T has approximately a t distribution with degrees of freedom estimated using Satterthwaite s approximation method ν = 2(l ˆQl ) 2 τ K τ τ is the gradient of lql with respect to (σ 2 γ,σ 2 ε ) K is the empirical covariance matrix of (σ 2 γ,σ 2 ε )

31 ANALYSIS OF YEAST GALACTOSE UTILIZATION DATA

32 EXTRACTING INTERESTING PATTERNS FROM TIME-EVOLVING NETWORKS Time-evolving network data consist of ordered sequences of graphs, e.g., network time-series

33 POPULAR APPROACH: TIME SERIES ANALYSIS OF NETWORK STATISTICS Extracting time series of network statistics (e.g. centrality parameter) allows direct application of time-series methods

34 DRAWBACKS OF NETWORK STATISTICS ANALYSIS Which network statistics? Heavily context dependent Often unknown and the easiest statistics to compute may not be informative. Which are the important nodes and how did they evolve over time? Usually requires additional, ad-hoc analysis

35 DECOMPOSITION OF THE NETWORK ADJACENCY MATRICES Matrix decompositions achieve dimension reduction Preserve essential features Large amount of existing work that can be leveraged for the problem at hand Which matrix decomposition? Non-negative Matrix Factorization

36 NON-NEGATIVE MATRIX FACTORIZATION Let Y be an observed n p matrix that is non-negative. NMF expresses Y UV T, where U R n K +,V Rp K +, and K << min{n,p}

37 WHY NMF? Better interpretability: Y ij = K k=1 U ik V kj, U ik V kj measures the contribution of cluster k to Y ij. Adjacency matrices are typically non-negative

38 MODELING NETWORK TIME SERIES Decompose spatio-temporal (network time-series) data as Space Time Basis Factors Smoothness Conditions. Intuition: Networks have short term fluctuations, but latent factors are smooth and exhibit long term trends.

39 EVOLVING FACTORIZATIONS We observe {Y t,t = 1,...,T } (network time-series), and posit Y t UVt T or U t Vt T. Depends on context ( different network types) and goal (clustering, heaviest element search, visual exploration).

40 OBTAINING ESTIMATES Based on optimizing the following objective function T U 0,V t 0 t=1 O = min + T t, t=1 Y t UV T t 2 F W (t, t) V t V t 2 F + λ g T t=1 Tr(V T t L t V t ) W (t, t) is a weight function that is proportional to some kernel and controls sensitivity to short term fluctuations. Similar to a Hodrick-Prescott filter. λ g,l t form a group penalty that control the importance of a priori clustering knowledge.

41 GROUP PENALTY Main Idea: If nodes i and j belong to the same group, then they should have similar coordinates given by V t. Define the Laplacian as L t = D t G t, where { 1, if nodes i and j belong to the same group (G t ) ij = 0, otherwise D t = diag( (G t ) ij,j = 1,...,n). i

42 LAPLACIAN SMOOTHING Fact: For every n K matrix V t, we have λ g Tr(V t T L t V t ) = λ g (G t ) ij ((V t ) ik (V t ) jk ) 2. k i,j The group penalty {λ g,l t } creates an abstract manifold at time t, and the weight function W (t, t) creates an abstract manifold between times t and t. The penalties utilize external information to create a topology that we embed and view the data in

43 ARXIV CITATION NETWORK Citation network sequence from the e-print service arxiv for the high energy physics theory section, and covers papers from October 1993 to December There are papers (nodes) with edges (references) over 112 months. Since citations never die, we posit Y t = UV T t.

44 Estimates of V t (Time-varying Paper Impact Scores) 1st Component 2nd Component Sum of Components Citation Network Layouts I II III IV V

45 HIGHEST IMPACT PAPERS BY V t Title Authors In-Degree Out-Degree # citations (Google) Heterotic and Type I String Dynamics Horava and Witten from Eleven Dimensions Five-branes And M-Theory On An Orbifold Witten D-Branes and Topological Field Theories Bershadsky, et. al Lectures on Superstring and M Theory Dualities Schwarz Type IIB Superstrings, BPS Monopoles, Hanany and Witten And Three-Dimensional Gauge Dynamics 2000 onwards Title Authors In-Degree Out-Degree # citations (Google) The Large N Limit of Superconformal Field Maldacena Theories and Supergravity Anti De Sitter Space And Holography Witten Gauge Theory Correlators from Non-Critical Klebanov and Polyakov String Theory Large N Field Theories, String Theory Aharony, et. al and Gravity String Theory and Noncommutative Geometry Seiberg and Witten

46 STATIC CLUSTERING The degree (number of connections) of each paper over all time points, colored by a top community detection algorithm (Newman PNAS, 2006). The groupings are not interpretable in terms of the time-profile of each paper.

47 EIGENVECTOR CENTRALITY Average age top 5 authorities top 10 authorities top 50 authorities top 100 authorities top 500 authorities Year The average age in months of the top authority papers over time (Kleinberg, J.ACM 1999). We see evidence for a change point around year 2000, but what about paper growth, grouping structure? Need more, ad-hoc analysis.

48 GENERAL REFERENCES Kolaczyk, E.D. (2009), Statistical Analysis of Network Data: Methods and Models, Springer. Feinberg, S. (2012), A Brief History of Statistical Models for Network Analysis and Open Challenges, Journal of Computational and Graphical Statistics, 20, Michailidis, G. (2012), Statistical Challenges in Biological Networks, Journal of Computational and Graphical Statistics, 20, Hunter, D., Krivitsky, P. and Scheinberger, M. (2012), Computational Statistical Methods for Social Network Models, Journal of Computational and Graphical Statistics, 20,

49 SPECIFIC TO THIS PRESENTATION Guo, J., Levina, E., Michailidis, G. and Zhu, J. (2011), Joint estimation of multiple graphical models, Biometrika, 98, 1-15 Shojaie A. and Michailidis G. (2009), Analysis of Gene Sets Based on The Underlying Regulatory Network, Journal of Computational Biology, 16(3): Mankad, S. and Michailidis, G. (2012), Structural and functional discovery in dynamic networks with non-negative matrix factorization, Physical Review E, forthcoming

The United States Senate

The United States Senate The United States Senate LEGISLATIVE ACTIVITIES THE HONORABLE Arlen Specter, of Pennsylvania 110th Congress January 04, 2007 to January 02, 2009 PREPARED BY THE SENATE SERGEANT AT ARMS, LEGISLATIVE SYSTEMS

More information

SMALL BUSINESS HEALTH PLANS S. 1955 RESOURCE KIT

SMALL BUSINESS HEALTH PLANS S. 1955 RESOURCE KIT SMALL BUSINESS HEALTH PLANS S. 1955 RESOURCE KIT Health Insurance Marketplace Modernization and Affordability Act Bringing Fortune 500 Benefits to Main Street, USA TABLE OF CONTENTS Introduction... 1 Quick

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]

More information

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

NETZCOPE - a tool to analyze and display complex R&D collaboration networks The Task Concepts from Spectral Graph Theory EU R&D Network Analysis Netzcope Screenshots NETZCOPE - a tool to analyze and display complex R&D collaboration networks L. Streit & O. Strogan BiBoS, Univ.

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Statistical and computational challenges in networks and cybersecurity

Statistical and computational challenges in networks and cybersecurity Statistical and computational challenges in networks and cybersecurity Hugh Chipman Acadia University June 12, 2015 Statistical and computational challenges in networks and cybersecurity May 4-8, 2015,

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Graphical Modeling for Genomic Data

Graphical Modeling for Genomic Data Graphical Modeling for Genomic Data Carel F.W. Peeters [email protected] Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Fitting Subject-specific Curves to Grouped Longitudinal Data

Fitting Subject-specific Curves to Grouped Longitudinal Data Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: [email protected] Currie,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016 and Principal Components Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016 Agenda Brief History and Introductory Example Factor Model Factor Equation Estimation of Loadings

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Part 2: Community Detection

Part 2: Community Detection Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])

More information

Multivariate Analysis (Slides 13)

Multivariate Analysis (Slides 13) Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables

More information

University of Lille I PC first year list of exercises n 7. Review

University of Lille I PC first year list of exercises n 7. Review University of Lille I PC first year list of exercises n 7 Review Exercise Solve the following systems in 4 different ways (by substitution, by the Gauss method, by inverting the matrix of coefficients

More information

Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

11. Time series and dynamic linear models

11. Time series and dynamic linear models 11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Business Intelligence and Process Modelling

Business Intelligence and Process Modelling Business Intelligence and Process Modelling F.W. Takes Universiteit Leiden Lecture 7: Network Analytics & Process Modelling Introduction BIPM Lecture 7: Network Analytics & Process Modelling Introduction

More information

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

Credit Risk Models: An Overview

Credit Risk Models: An Overview Credit Risk Models: An Overview Paul Embrechts, Rüdiger Frey, Alexander McNeil ETH Zürich c 2003 (Embrechts, Frey, McNeil) A. Multivariate Models for Portfolio Credit Risk 1. Modelling Dependent Defaults:

More information

Introduction to Matrix Algebra

Introduction to Matrix Algebra Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

CS 207 - Data Science and Visualization Spring 2016

CS 207 - Data Science and Visualization Spring 2016 CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler [email protected] An introduction to techniques for the automated and human-assisted analysis of data sets. These

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

More information

Statistical Analysis of Network Data

Statistical Analysis of Network Data Statistical Analysis of Network Data A Brief Overview Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University [email protected] Introduction Focus of this Talk In this talk I will present

More information

Understanding the Impact of Weights Constraints in Portfolio Theory

Understanding the Impact of Weights Constraints in Portfolio Theory Understanding the Impact of Weights Constraints in Portfolio Theory Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected] January 2010 Abstract In this article,

More information

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering

More information

Matrix Differentiation

Matrix Differentiation 1 Introduction Matrix Differentiation ( and some other stuff ) Randal J. Barnes Department of Civil Engineering, University of Minnesota Minneapolis, Minnesota, USA Throughout this presentation I have

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

How To Understand The Network Of A Network

How To Understand The Network Of A Network Roles in Networks Roles in Networks Motivation for work: Let topology define network roles. Work by Kleinberg on directed graphs, used topology to define two types of roles: authorities and hubs. (Each

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

Extracting correlation structure from large random matrices

Extracting correlation structure from large random matrices Extracting correlation structure from large random matrices Alfred Hero University of Michigan - Ann Arbor Feb. 17, 2012 1 / 46 1 Background 2 Graphical models 3 Screening for hubs in graphical model 4

More information

The Method of Least Squares

The Method of Least Squares Hervé Abdi 1 1 Introduction The least square methods (LSM) is probably the most popular technique in statistics. This is due to several factors. First, most common estimators can be casted within this

More information

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data n Introduction to the Use of ayesian Network to nalyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co. Università degli Studi Milano-icocca

More information

Numerical methods for American options

Numerical methods for American options Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

More information

An explicit link between Gaussian fields and Gaussian Markov random fields; the stochastic partial differential equation approach

An explicit link between Gaussian fields and Gaussian Markov random fields; the stochastic partial differential equation approach Intro B, W, M, & R SPDE/GMRF Example End An explicit link between Gaussian fields and Gaussian Markov random fields; the stochastic partial differential equation approach Finn Lindgren 1 Håvard Rue 1 Johan

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Lecture 2 Linear functions and examples

Lecture 2 Linear functions and examples EE263 Autumn 2007-08 Stephen Boyd Lecture 2 Linear functions and examples linear equations and functions engineering examples interpretations 2 1 Linear equations consider system of linear equations y

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

More information

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures)

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures) Social and Technological Network Analysis Lecture 3: Centrality Measures Dr. Cecilia Mascolo (some material from Lada Adamic s lectures) In This Lecture We will introduce the concept of centrality and

More information

jorge s. marques image processing

jorge s. marques image processing image processing images images: what are they? what is shown in this image? What is this? what is an image images describe the evolution of physical variables (intensity, color, reflectance, condutivity)

More information

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis Yu Liu School of Software [email protected] Zhen Huang School of Software [email protected] Yufeng Chen School of Computer

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

MSCA 31000 Introduction to Statistical Concepts

MSCA 31000 Introduction to Statistical Concepts MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Estimating an ARMA Process

Estimating an ARMA Process Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

MIMO CHANNEL CAPACITY

MIMO CHANNEL CAPACITY MIMO CHANNEL CAPACITY Ochi Laboratory Nguyen Dang Khoa (D1) 1 Contents Introduction Review of information theory Fixed MIMO channel Fading MIMO channel Summary and Conclusions 2 1. Introduction The use

More information

A mixture model for random graphs

A mixture model for random graphs A mixture model for random graphs J-J Daudin, F. Picard, S. Robin [email protected] UMR INA-PG / ENGREF / INRA, Paris Mathématique et Informatique Appliquées Examples of networks. Social: Biological:

More information

STANDING COMMITTEES OF THE SENATE

STANDING COMMITTEES OF THE SENATE STANDING COMMITTEES OF THE SENATE [Democrats in roman; Republicans in italic; Independent in SMALL CAPS; Independent Democrat in SMALL CAPS ITALIC] [Room numbers beginning with SD are in the Dirksen Building,

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott

More information

So which is the best?

So which is the best? Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which

More information

NodeXL for Network analysis Demo/hands-on at NICAR 2012, St Louis, Feb 24. Peter Aldhous, San Francisco Bureau Chief. peter@peteraldhous.

NodeXL for Network analysis Demo/hands-on at NICAR 2012, St Louis, Feb 24. Peter Aldhous, San Francisco Bureau Chief. peter@peteraldhous. NodeXL for Network analysis Demo/hands-on at NICAR 2012, St Louis, Feb 24 Peter Aldhous, San Francisco Bureau Chief [email protected] NodeXL is a template for Microsoft Excel 2007 and 2010, which

More information

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology http://tinyurl.com/bioinf525-w16 Course Director: Dr. Barry Grant (DCM&B, [email protected]) Description: This is a three module course covering (1) Foundations of Bioinformatics, (2) Statistics in Bioinformatics, and (3) Systems

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network , pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.

More information

Lasso on Categorical Data

Lasso on Categorical Data Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.

More information

Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

More information

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University Digital Imaging and Multimedia Filters Ahmed Elgammal Dept. of Computer Science Rutgers University Outlines What are Filters Linear Filters Convolution operation Properties of Linear Filters Application

More information

Bioinformatics: Network Analysis

Bioinformatics: Network Analysis Bioinformatics: Network Analysis Graph-theoretic Properties of Biological Networks COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Architectural features Motifs, modules,

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information

Sanjeev Kumar. contribute

Sanjeev Kumar. contribute RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a

More information

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions SMA 50: Statistical Learning and Data Mining in Bioinformatics (also listed as 5.077: Statistical Learning and Data Mining ()) Spring Term (Feb May 200) Faculty: Professor Roy Welsch Wed 0 Feb 7:00-8:0

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information