Statistical Analysis of Network Data
|
|
|
- Anna Sullivan
- 10 years ago
- Views:
Transcription
1 Statistical Analysis of Network Data A Brief Overview Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University [email protected]
2 Introduction Focus of this Talk In this talk I will present a brief overview of the foundations common to the statistical analysis of network data across the disciplines, from a statistical perspective. Approach will be that of a high-level, whirlwind overview of the topics of network summary and visualization network sampling network modeling and inference, and network processes. Concepts will be illustrated drawing on examples from bioinformatics, computer network traffic analysis, neuroscience, and social networks.
3 ISBN Introduction Resources Organization and presentation of material in this quick talk will largely parallel that in USE R! UseR! Use R! Eric Kolaczyk Gábor Csárdi Statistical Analysis of Network Data with R Th is book is the first of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Their roots are in social network analysis going back to the 1930s and graph theory going back centuries. This text also builds on Eric Kolaczyk s book Statistical Analysis of Network Data (Springer, 2009). Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affiliated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffic patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of influence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE). Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph software package in 2005 and has been one of the lead developers since then. Statistics Kolaczyk Csárdi 1 Statistical Analysis of Network Data with R Eric Kolaczyk Gábor Csárdi Statistical Analysis of Network Data with R
4 Introduction Our Focus... The statistical analysis of network data i.e., analysis of measurements either of or from a system conceptualized as a network. Challenges: relational aspect to the data; complex statistical dependencies (often the focus!); high-dimensional and often massive in quantity.
5 Network Mapping Outline 1 Introduction 2 Network Mapping 3 Network Characterization 4 Network Sampling 5 Network Modeling 6 Network Inference 7 Wrap-Up
6 Network Mapping Descriptive Statistics for Networks First two topics go together naturally, i.e., network mapping characterization of network graphs May seem soft... but it s important! This is basically descriptive statistics for networks. Probably constitutes at least 2/3 of the work done in this area. Note: It s sufficiently different from standard descriptive statistics that it s something unto itself.
7 Network Mapping Network Mapping What is network mapping? Production of a network-based visualization of a complex system. What is the network? Network as a system of interest; Network as a graph representing the system; Network as a visual object. Analogue: Geography and the production of cartographic maps.
8 Network Mapping Example: Mapping Belgium Which of these is the Belgium?
9 Network Mapping Three Stages of Network Mapping Question: What is the impact on network visualization of differential privacy applied at these various stages?
10 Network Characterization Outline 1 Introduction 2 Network Mapping 3 Network Characterization 4 Network Sampling 5 Network Modeling 6 Network Inference 7 Wrap-Up
11 Network Characterization Characterization of Network Graphs: Intro Given a network graph representation of a system (i.e., perhaps a result of network mapping), often questions of interest can be phrased in terms of structural properties of the graph. social dynamics can be connected to patterns of edges among vertex triples; routes for movement of information can be approximated by shortest paths between vertices; importance of vertices can be captured through so-called centrality measures; natural groups/communities of vertices can be approached through graph partitioning.
12 Network Characterization Characterization Intro (cont.) Structural analysis of network graphs descriptive analysis; this is a standard first (and sometimes only!) step in statistical analysis of networks. Main contributors of tools are social network analysis, mathematics & computer science, statistical physics Many tools out there... two rough classes include characterization of vertices/edges, and characterization of network cohesion.
13 Network Characterization Characterization of Vertices/Edges Examples include Degree distribution Vertex/edge centrality Role/positional analysis We ll look at the vertex centrality as an example.
14 Network Characterization Centrality: Motivation Many questions related to importance of vertices. Which actors hold the reins of power? How authoritative is a WWW page considered by peers? The deletions of which genes is more likely to be lethal? How critical to traffic flow is a given Internet router? Researchers have sought to capture the notion of vertex importance through so-called centrality measures.
15 Network Characterization Centrality: An Illustration Clockwise from top left: (i) toy graph, with (ii) closeness, (iii) betweenness, and (iv) eigenvector centralities. Example and figures courtesy of Ulrik Brandes.
16 Network Characterization Network Cohesion: Motivation Many questions involve scales coarser than just individual vertices/edges. More properly considered questions regarding cohesion of network. Do friends of actors tend to be friends themselves? Which proteins are most similar to each other? Does the WWW tend to separate according to page content? What proportion of the Internet is constituted by the backbone? These questions go beyond individual vertices/edges.
17 Network Characterization Network Cohesion: Various Notions! Various notions of cohesion. density clustering connectivity flow partitioning... and more...
18 Network Characterization Illustration: Detecting Malicious Internet Sources Source/Destination Network Ding et al. a use the idea of cutvertices to detect Internet IP addresses associated with malicious behavior. Projected Source Network Corresponds to a type of (anti)social behavior. a Ding, Q., Katenka, N., Barford, P., Kolaczyk, E.D., and Crovella, M. (2012). Intrusion as (Anti)social Communication: Characterization and Detection. Proceedings of the 2012 ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
19 Network Sampling Outline 1 Introduction 2 Network Mapping 3 Network Characterization 4 Network Sampling 5 Network Modeling 6 Network Inference 7 Wrap-Up
20 Network Sampling Network Sampling: Point of Departure... Common modus operandi in network analysis: System of elements and their interactions is of interest. Collect elements and relations among elements. Represent the collected data via a network. Characterize properties of the network. Sounds good... right?
21 Network Sampling Interpretation: Two Scenarios With respect to what frame of reference are the network characteristics interpreted? 1 The collected network data are themselves the primary object of interest. 2 The collected network data are interesting primarily as representative of an underlying true network. The distinction is important! Under Scenario 2, statistical sampling theory becomes relevant... but is not trivial.
22 Network Sampling Common Network Sampling Designs Viewed from the perspective of classical statistical sampling theory, the network sampling design is important. Examples include Induced Subgraph Sampling Incident Subgraph Sampling Snowball Sampling Link Tracing
23 Network Sampling Common Network Sampling Designs (cont.) Induced Subgraph Sampling Incident Subgraph Sampling s1 s2 t1 Snowball Sampling Traceroute Sampling t2
24 Network Sampling Caveat emptor... Completely ignoring sampling issues is equivalent to using plug-in estimators. The resulting bias(es) can be both substantial and unpredictable! BA PPI AS arxiv Degree Exponent = = = Average Path Length = Betweenness = = = Assortativity = = = = = = = = Clustering Coefficient = = Lee et al (2006): Entries indicate direction of bias for induced subgraph (red), incident subgraph (green), and snowball (blue) sampling.
25 Network Sampling Accounting for Sampling Design Accounting for sampling design can be non-trivial. Classical work goes back to the 1970 s (at least), with contributions of Frank and colleagues, based mainly on Horvitz-Thompson theory. More recent resurgence of interest, across communities, has led to additional studies using both classical and modern tools. See Kolaczyk (2009), Chapter 5.
26 Network Sampling Illustration: Estimation of Degree Distribution Under a variety of sampling designs, the following holds: where E[N ] = PN, (1) N = (N 0, N 1,..., N M ): the true degree vector, for N i : the number of vertices with degree i in the original graph N = (N0, N 1,..., N M ): the observed degree vector, for : the number of vertices with degree i in the sampled graph N i P is an M + 1 by M + 1 matrix operator, where M = maximum degree in the original graph
27 Network Sampling Estimating Degree Distribution: An Inverse Problem Ove Frank (1978) proposed solving for the degree distribution by an unbiased estimator of N, defined as ˆN naive = P 1 N. (2) There are two problems with this simple solution: 1 The matrix P is typically not invertible in practice. 2 The non-negativity of the solution is not guaranteed.
28 Network Sampling Does It Really Matter? Yes! True Degree Sequence Naive Estimated Degree Sequence Number of vertices Number of vertices Degree Degree Figure : Left: ER graph with 100 vertices and 500 edges. Right: Naive estimate of degree distribution, according to equation (2). Data drawn according to induced subgraph sampling with sampling rate p = 60%.
29 Network Sampling A Modern Variant: Constrained, Penalized WLS We have recently proposed 1 a penalized weighted least squares with additional constraints. where minimize N subject to C = Cov(N ), (PN N ) T C 1 (PN N ) + λ pen(n) N i 0, i = 0, 1,... M M N i = n v, i=0 pen(n) is a penalty on the complexity of N, λ is a smoothing parameter, and n v is the total number of vertices of the true graph. 1 Zhang, Y., Kolaczyk, E.D., and Spencer, B.D. (2013). Estimating network degree distributions under sampling: an inverse problem, with applications to monitoring social media networks. Under review by Annals of Applied Statistics. arxiv (3)
30 Network Sampling Application to Online Social Networks Friendster cmty 1 5 Friendster cmty 6 15 Friendster cmty log2(freq) log2(freq) log2(freq) log2(degree) Orkut cmty log2(degree) LiveJournal cmty log2(degree) log2(freq) log2(freq) log2(freq) log2(degree) Orkut cmty log2(degree) LiveJournal cmty log2(degree) log2(freq) log2(freq) log2(freq) log2(degree) Orkut cmty log2(degree) LiveJournal cmty log2(degree) Figure : Estimating degree distributions of communities from Friendster, Orkut and Livejournal. Blue dots represent the true degree distributions, black dots represent the sample degree distributions, red dots represent the estimated degree distributions. Sampling rate=30%. Dots which correspond to a density < 10 4 are eliminated from the plot.
31 Network Sampling Estimating Approximate Epidemic Thresholds: Friendster Moments of degree distributions can be used to obtain bounds of the network s epidemic threshold τ c. Friendster cmty Friendster cmty 6 15 Friendster cmty An approximate threshold is give by the inverse of the largest eigenvalue λ 1 of the adjacency matrix (Mieghem, Omic, & Kooij 09) Simple bounds for λ 1 are M 1 M 2 λ 1 (2N e ) 1/2 (4) 0 1 / M 1 1/ M 2 1 / E 0 1 / M 1 1/ M 2 1 / E 0 1 / M 1 1/ M 2 Figure : Estimated bounds for epidemic threshold in Friendster, based on 20 samples. Four horizontal lines are the true values for 1 M2, 1 λ 1 and 1 2Ne from top to bottom. 1 / E 1 M 1,
32 Network Modeling Outline 1 Introduction 2 Network Mapping 3 Network Characterization 4 Network Sampling 5 Network Modeling 6 Network Inference 7 Wrap-Up
33 Network Modeling 2 A third option, that we observe attributes X, but lack some or all of the network G, and we wish to infer G, is usually called network topology inference, which we ll talk about last. Two Scenarios We will look at two complementary scenarios 2 : 1 we observe a network G (and possibly attributes X) and we wish to model G (and X); 2 we observe the network G, but lack some or all of the attributes X, and we wish to infer X. These are, of course, caricatures. Reality can be more complex!
34 Network Modeling High Standards Statisticians demand a great deal of their modeling: 1 theoretically plausible 2 estimable from data 3 computationally feasible estimation strategies 4 quantification of uncertainty in estimates (e.g., confidence intervals) 5 assessment of goodness-of-fit 6 understanding of the statistical properties of the overall procedure Still have a long way to go in the context of networks!
35 Network Modeling Classes of Statistical Network Models Roughly speaking, there are network-based versions of three canonical classes of statistical models: 1 regression models (i.e., ERGMs) 2 latent variable models (i.e., latent network models) 3 mixture models 3 (i.e., stochastic blocks models) 3 These may be viewed as a special case of latent variable models.
36 Network Modeling Statistical Network Models: Progress and Challenges This is one of the most active areas of research in statistics and networks. Most work in ERGMs and SBMs. A few high-level comments: ERGMS have the largest body of work associated with them but they also arguably have the greatest number of problems (i.e., degeneracy, instability, problems under sampling, as well as the least supporting formal theory). SBMs arguably have the most extensive theoretical development (i.e., fully general formulation, consistency and asymptotic normality of parameter estimates, etc.) but they can be still too simple for many modeling situations, and lack the link to regression possessed by ERGMs.
37 Network Modeling Processes on Network Graphs So far we have focused on network graphs, as representations of network systems of elements and their interactions. But often it is some quantity associated with the elements that is of most interest, rather than the network per se. Nevertheless, such quantities may be influenced by the interactions among elements. Examples: Behaviors and beliefs influenced by social interactions. Functional role of proteins influenced by their sequence similarity. Computer infections by viruses may be affected by proximity to infected computers.
38 Network Modeling Illustration: Predicting Signaling in Yeast Baker s yeast (i.e., S. cerevisiae) All proteins known to participate in cell communication and their interactions Question: Is knowledge of the function of a protein s neighbors predictive of that protein s function? In fact... yes!
39 Network Modeling A Simple Approach: Nearest-Neighbor Prediction Egos w/ ICSC A simple predictive algorithm uses nearest neighbor principles. Frequency Proportion Neighbors w/ ICSC Egos w/out ICSC Let X i = { 1, if corporate 0, if litigation Frequency Proportion Neighbors w/ ICSC Compare j N i x j N i to a threshold.
40 Network Modeling Modeling Static Network-Indexed Processes The nearest-neighbor algorithm (also sometimes called guilt-by-association ), although seemingly informal, can be quite competitive with more formal, model-based methods. Various models have been proposed for static network-indexed processes. Two commonly used classes/paradigms: Markov random field (MRF) models = Extends ideas from spatial/lattice modeling. Kernel-learning regression models = Key innovation is construction of graph kernels
41 Network Inference Outline 1 Introduction 2 Network Mapping 3 Network Characterization 4 Network Sampling 5 Network Modeling 6 Network Inference 7 Wrap-Up
42 Network Inference Network Topology Inference Recall our characterization of network mapping, as a three-stage process involving 1 Collecting relational data 2 Constructing a network graph representation 3 Producing a visualization of that graph Network topology inference is the formalization of Step 2 as a task in statistical inference. Note: Casting the task this way also allows us to formalize the question of validation.
43 Network Inference Network Topology Inference (cont.) There are many variants of this problem! Three general, and fairly broadly applicable, versions are Link prediction Association network inference Tomographic network inference
44 Network Inference Schematic Comparison of Inference Problems Original Network Link Prediction Assoc. Network Inf. Tomography
45 Network Inference Link Prediction: Examples Examples of link prediction include predicting new hyperlinks in the WWW assessing the reliability of declared protein interactions predicting international relations between countries
46 Network Inference Link Prediction: Problem & Solutions Goal is to predict the edge status Y miss for all potential edges with missing (i.e., unknown) status, based on observed status Y obs, and any other auxilliary information. Two main classes of methods proposed in the literature to date. Scoring methods Classification methods See Kolaczyk (2009), Chapter 7.2.
47 Network Inference Link Prediction: Illustration Number of Common Neighbors No Edge Edge Scoring methods come in all shapes and sizes. Number of common neighbors a basic example. Sufficient here (in a network of blogs) to obtain substantial discrimination (e.g., AUC of 0.9 in leave-one-out prediction).
48 Wrap-Up Outline 1 Introduction 2 Network Mapping 3 Network Characterization 4 Network Sampling 5 Network Modeling 6 Network Inference 7 Wrap-Up
49 Wrap-Up Wrapping Up Lots of additional topics we have not touched upon: Dynamic networks Weighted networks Community detection Etc.
50 Wrap-Up Wrapping Up (cont.) Some of the things my group is working on currently include: Uncertainty in graph summary statistics under noisy conditions. Asymptotics for parameter estimation in network models. Estimation of degree distributions from sampled data. Bayesian latent-factor network perturbation models. Multi-attribute networks.
Statistical and computational challenges in networks and cybersecurity
Statistical and computational challenges in networks and cybersecurity Hugh Chipman Acadia University June 12, 2015 Statistical and computational challenges in networks and cybersecurity May 4-8, 2015,
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
Practical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
Network Analysis. BCH 5101: Analysis of -Omics Data 1/34
Network Analysis BCH 5101: Analysis of -Omics Data 1/34 Network Analysis Graphs as a representation of networks Examples of genome-scale graphs Statistical properties of genome-scale graphs The search
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Analysis of Internet Topologies
Analysis of Internet Topologies Ljiljana Trajković [email protected] Communication Networks Laboratory http://www.ensc.sfu.ca/cnl School of Engineering Science Simon Fraser University, Vancouver, British
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Sampling Biases in IP Topology Measurements
Sampling Biases in IP Topology Measurements Anukool Lakhina with John Byers, Mark Crovella and Peng Xie Department of Boston University Discovering the Internet topology Goal: Discover the Internet Router
Linear Programming I
Linear Programming I November 30, 2003 1 Introduction In the VCR/guns/nuclear bombs/napkins/star wars/professors/butter/mice problem, the benevolent dictator, Bigus Piguinus, of south Antarctica penguins
Distance Degree Sequences for Network Analysis
Universität Konstanz Computer & Information Science Algorithmics Group 15 Mar 2005 based on Palmer, Gibbons, and Faloutsos: ANF A Fast and Scalable Tool for Data Mining in Massive Graphs, SIGKDD 02. Motivation
IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS
IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals
Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network
, pp.273-284 http://dx.doi.org/10.14257/ijdta.2015.8.5.24 Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network Gengxin Sun 1, Sheng Bin 2 and
Part 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Walk-Based Centrality and Communicability Measures for Network Analysis
Walk-Based Centrality and Communicability Measures for Network Analysis Michele Benzi Department of Mathematics and Computer Science Emory University Atlanta, Georgia, USA Workshop on Innovative Clustering
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
Introduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009
Exponential Random Graph Models for Social Network Analysis Danny Wyatt 590AI March 6, 2009 Traditional Social Network Analysis Covered by Eytan Traditional SNA uses descriptive statistics Path lengths
Graph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
Graph theoretic approach to analyze amino acid network
Int. J. Adv. Appl. Math. and Mech. 2(3) (2015) 31-37 (ISSN: 2347-2529) Journal homepage: www.ijaamm.com International Journal of Advances in Applied Mathematics and Mechanics Graph theoretic approach to
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS
USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected] ABSTRACT This
Network (Tree) Topology Inference Based on Prüfer Sequence
Network (Tree) Topology Inference Based on Prüfer Sequence C. Vanniarajan and Kamala Krithivasan Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai 600036 [email protected],
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)
1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM) 2. Gene regulation tools and methods Regulatory sequences and motif discovery TF binding sites, microrna target prediction
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,
http://www.aleks.com Access Code: RVAE4-EGKVN Financial Aid Code: 6A9DB-DEE3B-74F51-57304
MATH 1340.04 College Algebra Location: MAGC 2.202 Meeting day(s): TR 7:45a 9:00a, Instructor Information Name: Virgil Pierce Email: [email protected] Phone: 665.3535 Teaching Assistant Name: Indalecio
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico
Data Science Center Eindhoven Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico Dutch Mathematical Congress April 15, 2015 Contents 1. Big Data terminology 2. Various
Bioinformatics: Network Analysis
Bioinformatics: Network Analysis Graph-theoretic Properties of Biological Networks COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay Nakhleh, Rice University 1 Outline Architectural features Motifs, modules,
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)
Social Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users
How To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
Social Media Mining. Graph Essentials
Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures
How To Get A Computer Science Degree At Appalachian State
118 Master of Science in Computer Science Department of Computer Science College of Arts and Sciences James T. Wilkes, Chair and Professor Ph.D., Duke University [email protected] http://www.cs.appstate.edu/
A Brief Introduction to Property Testing
A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as
Course: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 [email protected] Abstract Probability distributions on structured representation.
How To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Equivalence Concepts for Social Networks
Equivalence Concepts for Social Networks Tom A.B. Snijders University of Oxford March 26, 2009 c Tom A.B. Snijders (University of Oxford) Equivalences in networks March 26, 2009 1 / 40 Outline Structural
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Traffic Prediction in Wireless Mesh Networks Using Process Mining Algorithms
Traffic Prediction in Wireless Mesh Networks Using Process Mining Algorithms Kirill Krinkin Open Source and Linux lab Saint Petersburg, Russia [email protected] Eugene Kalishenko Saint Petersburg
NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS
NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS TEST DESIGN AND FRAMEWORK September 2014 Authorized for Distribution by the New York State Education Department This test design and framework document
Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Complex Networks Analysis: Clustering Methods
Complex Networks Analysis: Clustering Methods Nikolai Nefedov Spring 2013 ISI ETH Zurich [email protected] 1 Outline Purpose to give an overview of modern graph-clustering methods and their applications
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful
. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Exploratory Spatial Data Analysis
Exploratory Spatial Data Analysis Part II Dynamically Linked Views 1 Contents Introduction: why to use non-cartographic data displays Display linking by object highlighting Dynamic Query Object classification
Practical statistical network analysis (with R and igraph)
Practical statistical network analysis (with R and igraph) Gábor Csárdi [email protected] Department of Biophysics, KFKI Research Institute for Nuclear and Particle Physics of the Hungarian Academy of
A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION
A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS Kyoungjin Park Alper Yilmaz Photogrammetric and Computer Vision Lab Ohio State University [email protected] [email protected] ABSTRACT Depending
Solving Simultaneous Equations and Matrices
Solving Simultaneous Equations and Matrices The following represents a systematic investigation for the steps used to solve two simultaneous linear equations in two unknowns. The motivation for considering
Unsupervised Outlier Detection in Time Series Data
Unsupervised Outlier Detection in Time Series Data Zakia Ferdousi and Akira Maeda Graduate School of Science and Engineering, Ritsumeikan University Department of Media Technology, College of Information
High-dimensional labeled data analysis with Gabriel graphs
High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use
Faculty of Science School of Mathematics and Statistics
Faculty of Science School of Mathematics and Statistics MATH5836 Data Mining and its Business Applications Semester 1, 2014 CRICOS Provider No: 00098G MATH5836 Course Outline Information about the course
The Open University s repository of research publications and other research outputs
Open Research Online The Open University s repository of research publications and other research outputs The degree-diameter problem for circulant graphs of degree 8 and 9 Journal Article How to cite:
Formulations of Model Predictive Control. Dipartimento di Elettronica e Informazione
Formulations of Model Predictive Control Riccardo Scattolini Riccardo Scattolini Dipartimento di Elettronica e Informazione Impulse and step response models 2 At the beginning of the 80, the early formulations
General Network Analysis: Graph-theoretic. COMP572 Fall 2009
General Network Analysis: Graph-theoretic Techniques COMP572 Fall 2009 Networks (aka Graphs) A network is a set of vertices, or nodes, and edges that connect pairs of vertices Example: a network with 5
Common Core Unit Summary Grades 6 to 8
Common Core Unit Summary Grades 6 to 8 Grade 8: Unit 1: Congruence and Similarity- 8G1-8G5 rotations reflections and translations,( RRT=congruence) understand congruence of 2 d figures after RRT Dilations
SGL: Stata graph library for network analysis
SGL: Stata graph library for network analysis Hirotaka Miura Federal Reserve Bank of San Francisco Stata Conference Chicago 2011 The views presented here are my own and do not necessarily represent the
A discussion of Statistical Mechanics of Complex Networks P. Part I
A discussion of Statistical Mechanics of Complex Networks Part I Review of Modern Physics, Vol. 74, 2002 Small Word Networks Clustering Coefficient Scale-Free Networks Erdös-Rényi model cover only parts
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]
Graphical Modeling for Genomic Data
Graphical Modeling for Genomic Data Carel F.W. Peeters [email protected] Joint work with: Wessel N. van Wieringen Mark A. van de Wiel Molecular Biostatistics Unit Dept. of Epidemiology & Biostatistics
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Big Data: a new era for Statistics
Big Data: a new era for Statistics Richard J. Samworth Abstract Richard Samworth (1996) is a Professor of Statistics in the University s Statistical Laboratory, and has been a Fellow of St John s since
How To Understand Multivariate Models
Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models
2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering
2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering Compulsory Courses IENG540 Optimization Models and Algorithms In the course important deterministic optimization
Analysis of Internet Topologies: A Historical View
Analysis of Internet Topologies: A Historical View Mohamadreza Najiminaini, Laxmi Subedi, and Ljiljana Trajković Communication Networks Laboratory http://www.ensc.sfu.ca/cnl Simon Fraser University Vancouver,
Graph Mining Techniques for Social Media Analysis
Graph Mining Techniques for Social Media Analysis Mary McGlohon Christos Faloutsos 1 1-1 What is graph mining? Extracting useful knowledge (patterns, outliers, etc.) from structured data that can be represented
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
B490 Mining the Big Data. 2 Clustering
B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern
How To Write A Data Analysis
Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction
Mining Social-Network Graphs
342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Reputation Management Algorithms & Testing. Andrew G. West November 3, 2008
Reputation Management Algorithms & Testing Andrew G. West November 3, 2008 EigenTrust EigenTrust (Hector Garcia-molina, et. al) A normalized vector-matrix multiply based method to aggregate trust such
Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis
Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis Hulin Wu, PhD, Professor (with Dr. Shuang Wu) Department of Biostatistics &
QUALITY ENGINEERING PROGRAM
QUALITY ENGINEERING PROGRAM Production engineering deals with the practical engineering problems that occur in manufacturing planning, manufacturing processes and in the integration of the facilities and
Drawing a histogram using Excel
Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to
